Print

Print


Good morning Sean,

I was able to reproduce the problem using only ECJ and none of my custom
Problem/Statistics implementations.

Attached are the parameters I used for the master and the slave,
respectively.  Before running the master, I fire up 10 slaves on my local
machine like so:

for i in {1..10}; do java -cp dist/ECJ.jar ec.eval.Slave -file
> example_slave.params & done;


FWIW, I'm on OS X 10.8.5.

Siggy

On Fri, Jun 6, 2014 at 8:26 AM, Sean Luke <[log in to unmask]> wrote:

> The protocol for shutting down is that when the master wants to quit,
> either because it (1) was told by a slave that it found the optimal
> individual or (2) it ran out of time, it then sends a signal to each of the
> slaves.  When a slave receives this signal, it's supposed to quit.  The
> master waits for all the slaves to quit, then it dies.
>
> This means that the bug could be in one of two places
>         - The master isn't sending a slave the signal
>         - The slave received the signal but isn't quitting
>
> Your analysis suggests it's the first, but that seems unlikely.  I need
> you to do some printing on the slave side -- or a dump to the slave
> statistics file -- which indicates that the slave did in fact receive the
> signal.
>
> As to the FATAL ERROR.  This error happens when the master breaks
> connection with the slave, which only happens when the master is quitting.
>  The slave then quits as a result due to the exceptional condition.  Now
> this is a deeper mystery because you said that the master *isn't* quitting;
> and furthermore the only other possibility is that the master has closed
> the connection because it knows the slave is quitting... but that's not
> happening.
>
> I need a code example.  Can you provide me one personally?
>
> Sean
>
> On Jun 6, 2014, at 12:47 AM, Eric 'Siggy' Scott <[log in to unmask]> wrote:
>
> > PS: I've confirmed that this also occurs with SimpleEvolutionState, not
> just SteadyStateEvolutionState, so it has nothing to do with asynchronous
> evolution proper.
> >
> >
> > On Fri, Jun 6, 2014 at 12:35 AM, Eric 'Siggy' Scott <[log in to unmask]>
> wrote:
> > I'm running asynchronous evolution with the latest SVN revision.  This
> is a toy experiment, so all my slaves are running on the same machine as
> the master.
> >
> > All the slaves are supposed to shut themselves down at the end of the
> run upon a command from the master (in non-daemon mode), or keep purring in
> the background (in daemon mode).
> >
> > What I see, however, is that in both cases a handful of slaves shut
> down, and handful do not, and then the master hangs.
> >
> > For instance, the following is a case where I have 10 slaves running in
> daemon mode, and master executing multiple jobs in sequence.  Job 0 ran
> fine -- I got 10 "connected successfully" messages at the beginning and 10
> "shut down" messages at the end.  But here is the output of job 1:
> >
> > Threads:  breed/1 eval/1
> > Seed: 1869276809
> > Job: 1
> > Setting up
> > WARNING:
> > You've chosen to use Steady-State Evolution, but your statistics does
> not implement the SteadyStateStatisticsForm.
> > PARAMETER: stat.child.0
> > Initializing Generation 0
> > Slave /127.0.0.1/1402028546585 connected successfully.
> > Slave /127.0.0.1/1402028546798 connected successfully.
> > Slave /127.0.0.1/1402028546699 connected successfully.
> > Slave /127.0.0.1/1402028546658 connected successfully.
> > Slave /127.0.0.1/1402028546690 connected successfully.
> > Slave /127.0.0.1/1402028546773 connected successfully.
> > Slave /127.0.0.1/1402028546763 connected successfully.
> > Slave /127.0.0.1/1402028546753 connected successfully.
> > Slave /127.0.0.1/1402028546794 connected successfully.
> > Slave /127.0.0.1/1402028546755 connected successfully.
> > Generation 1  Evaluations 50
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 2  Evaluations 100
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 3  Evaluations 150
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 4  Evaluations 200
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 5  Evaluations 250
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 6  Evaluations 300
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 7  Evaluations 350
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 8  Evaluations 400
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 9  Evaluations 450
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Generation 10 Evaluations 500
> > Subpop 0 best fitness of generation Fitness: 1.0
> > Subpop 0 best fitness of run: Fitness: 1.0
> > Slave /127.0.0.1/1402028546585 shut down.
> > Slave /127.0.0.1/1402028546798 shut down.
> > Slave /127.0.0.1/1402028546699 shut down.
> > Slave /127.0.0.1/1402028546658 shut down.
> > Slave /127.0.0.1/1402028546690 shut down.
> >
> > After shutting down a fraction of the slaves, it hangs.  I have to
> control-C to exit.
> >
> > The failure appears to be random -- sometimes it occurs at the end of
> the 0th job.  It still occurs when the slaves are launched in non-daemon
> mode.  Sometimes the failure does not show up until the 4th or 5th job.
>  The number of slaves that succeed or fail to shut down appears to be
> arbitrary.  In short, we have all the signs of a race condition.
> >
> > When there is a failure, some of the slaves (but not necessarily the
> same number of slaves that succeeded or failed to shut down) print the
> message:
> >
> > FATAL ERROR:
> > Unable to read individual from master.java.net.SocketException: Broken
> pipe
> >
> > I ran a debugger on a non-daemon slave that failed to shut down -- it
> seemed to be stuck happily waiting to receive a message from the master, as
> if it'd never been told to shut down at all.
> >
> > Besides that, I don't know what's going on.
> >
> > Siggy
> >
> > --
> >
> > Ph.D student in Computer Science
> > George Mason University
> > http://mason.gmu.edu/~escott8/
> >
> >
> >
> > --
> >
> > Ph.D student in Computer Science
> > George Mason University
> > http://mason.gmu.edu/~escott8/
>



-- 

Ph.D student in Computer Science
George Mason University
http://mason.gmu.edu/~escott8/