I've had an idea for a few months now, that I've been letting
percolate in the back half of my brain. If I understand the way the
current master-slave processing works, the master starts up, and
uniformly queues jobs to slaves. Each job consists of one or more
individuals (depending on whether we're running a SimpleProblemForm
or a GroupedProblemForm).
For each job on the queue, the master sends the appropriate
individuals to the slave, the slave evaluates them, and then sends
the results to the master.
So for a SimpleProblemForm with, say, 1024 individuals and 4 slaves,
there are 256 cycles of send-wait-result per slave.
My concerns are:
1. The algorithm doesn't address the relative processing power of
each slave in any way.
2. As the comments in the API state, master-slave only makes sense if
the communication time for a single job is a fraction of the
To address these issues, I propose to work on a solution that will at
least work for SimpleProblemForms, as follows:
INIT: For each slave, enter 1.0 in efficiency array.
1. Master divides individuals to be evaluated among slaves according
to their efficiency as follows:
#individuals_to_evaluate[slave] = e[slave] / sum(e)
where e[slave] is the efficiency entry in the array, and
sum(e) is the sum of all entries in the efficiency array.
2. Send ALL individuals to be evaluated to each slave. Master awaits
3. Slave evaluates all individuals sent to it.
4. Slave sends ALL results to master.
5. Once all results are in, based on the times and number of
individuals evaluated, master updates the slave efficiency array as
e[slave] = #individuals_evaluated[slave] / time_to_evaluate[slave]
6. Master computes next generation. Repeat to step 1.
I feel this addresses the two issues in this way:
1. Although not perfect, each slave, after the first generation, is
assigned a number of individuals to evaluate such that all slaves
should return at statistically the same time, give or take.
2. The communication time becomes small relative to the evaluation
time by having the slaves evaluate many individuals per communication
rather than just one individual.
One issue is what happens if a slave fails. In this case, it would
probably make sense to have a timeout per slave, such that if the
slave does not return within, say, twice the expected time, the slave
is considered nonresponsive, and removed from the list of available
The expected time would be calculated based on the time it takes for
the first slave to return. In generations > 1 this makes sense
because all slaves are expected to return in roughly the same amount
of time. For generation 1, this will kill off slaves who are
unreasonably slow relative to the fastest slave. This isn't too good
if the fastest slave happens to be an outlier, though.
Any thoughts? Comments? Criticisms?