On Oct 26, 2006, at 1:50 PM, Robert Baruch wrote:
> I've had an idea for a few months now, that I've been letting
> percolate in the back half of my brain. If I understand the way the
> current master-slave processing works, the master starts up, and
> uniformly queues jobs to slaves. Each job consists of one or more
> individuals (depending on whether we're running a SimpleProblemForm
> or a GroupedProblemForm).
> For each job on the queue, the master sends the appropriate
> individuals to the slave, the slave evaluates them, and then sends
> the results to the master.
> So for a SimpleProblemForm with, say, 1024 individuals and 4
> slaves, there are 256 cycles of send-wait-result per slave.
> My concerns are:
> 1. The algorithm doesn't address the relative processing power of
> each slave in any way.
> 2. As the comments in the API state, master-slave only makes sense
> if the communication time for a single job is a fraction of the
> evaluation time.
If the documentation says something along those lines, we need to
change it. Actually the way the mechanism works is somewhat different.
When a slave comes online, it joins a pool of available slaves. When
the system must evaluate N individuals, say, a population's worth, it
goes through the following loop:
while there are still individuals to be processed
pick an available slave arbitrarily
give that slave some (small number) M individuals
In the background, it waits for slaves to come back finished, marks
those individuals as finished, and the slave enters the available
list again. If a slave goes down while processing, its unprocessed
individuals are put back in the need-to-be-processed list again.
It's true ECJ will assign jobs to every one of your processors if
there are many more individuals than processors. But if you have a
fast processor on one machine and a slow one on another, the fast
processor will come available more often and balance the load.
The advantage of this mechanism over your proposal (which is good
btw) is that slow processing isn't just a function of the CPU speed.
It's more often than not a function of the size and complexity of the
individual. And that's not something you can necessarily figure out
a priori beforehand.
Still there might be some improvements, mostly in the identification
of which slave to pick. If a slave has a record of coming back
rapidly, maybe it ought to go to the front of the line per se.