Print

Print


This sounds a *lot* like your process is running out of ports; that is, it's probably not an ECJ error or even a Java problem but rather an OS problem.  But at just 129 slaves?  That seems really low to me. We regularly handle rather more than that.

What OS and Java version are you running?  And which ECJ version?

shutdown() only happens if either the readLoop or the writeLoop of the underlying slave connection has had its connection broken.  What's the exception being thrown?  In writeLoop and readLoop, change 

	shutdown(state);

to

	e.printStackTrace(); shutdown(state);

Sean

On Aug 26, 2013, at 9:23 AM, Ralf Buschermöhle wrote:

> On the server side it's a "normal" disconnect (i modified the output a little)  
> 
> Slave is shutting down....
> 2013-08-26 13:14:16.758 Slave exits, connected slaves: 129
> 
> and everything should be closed. Snipped from SlaveConnection.java
> 
> protected void shutdown( final EvolutionState state )
>        {
>        // prevent me from hitting this multiple times
>        synchronized(shutDownLock) { if (shuttingDown) return; else shuttingDown = true; }
> 
>        // don't want to miss any of these so we'll wrap them individually
>        try { dataOut.writeByte(Slave.V_SHUTDOWN); } catch (Exception e) { }  // exception, not IOException, because JZLib throws some array exceptions
>        try { dataOut.flush(); } catch (Exception e) { }
>        try { dataOut.close(); } catch (Exception e) { }
>        try { dataIn.close(); } catch (Exception e) { }
>        try { evalSocket.close(); } catch (IOException e) { }
> 
>        state.output.systemMessage( SlaveConnection.this.toString() + " is shutting down...." );
> 
> Or?
> 
> On Aug 26, 2013, at 14:58, Ralf Buschermöhle <[log in to unmask]> wrote:
> 
>> BTW2: Nothing on the server ... it seems that the master does not know about the connection attempt.
>> 
>> Running out of ports?
>> 
>> On Aug 26, 2013, at 14:54, Ralf Buschermöhle <[log in to unmask]> wrote:
>> 
>>> Hi,
>>> 
>>> after the master has connected successfully some thousand slaves it can't connect anymore and each new slave receives 
>>> 
>>> FATAL ERROR (EvolutionState not created yet): java.net.SocketException: Connection reset
>>> 
>>> Nevertheless long evolution runs are not the problem. Only connects (a few thousand).
>>> 
>>> Any suggestions where to start digging? 
>>> 
>>> Greetings,
>>> 
>>> 	Ralf
>>> 
>>> P.S. Hm ... btw: the clients disconnect (brutally) by kicking them (running on a cluster with a wall-time).
>>> 
>> 
>