Hi,
On Aug 26, 2013, at 15:47, Sean Luke <[log in to unmask]> wrote:
> This sounds a *lot* like your process is running out of ports; that is, it's probably not an ECJ error or even a Java problem but rather an OS problem. But at just 129 slaves?
These are just the running nodes. Previously there have been a few thousand connections from a cluster (handled successfully).
> That seems really low to me. We regularly handle rather more than that.
>
> What OS
Masterside: FreeBSD, Clients on Debian / Scientific Linux (afaik all latest stable versions)
> and Java version are you running?
Master:
OpenJDK Runtime Environment (build 1.6.0_32-b27), 64-Bit Server VM (build 20.0-b12, mixed mode)
Client Setup 1:
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
Client Setup 2:
openjdk version "1.6.0_32"
OpenJDK Runtime Environment (build 1.6.0_32-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
> And which ECJ version?
21
>
> shutdown() only happens if either the readLoop or the writeLoop of the underlying slave connection has had its connection broken. What's the exception being thrown? In writeLoop and readLoop, change
>
> shutdown(state);
>
> to
>
> e.printStackTrace(); shutdown(state);
Ok ... just changed the code so that the exception trace will be printed.
I will report back ... the current computation needs approx. 2 hours before hitting the next checkpoint and then I'll restart the computation nodes.
Thanks,
Ralf
> On Aug 26, 2013, at 9:23 AM, Ralf Buschermöhle wrote:
>
>> On the server side it's a "normal" disconnect (i modified the output a little)
>>
>> Slave is shutting down....
>> 2013-08-26 13:14:16.758 Slave exits, connected slaves: 129
>>
>> and everything should be closed. Snipped from SlaveConnection.java
>>
>> protected void shutdown( final EvolutionState state )
>> {
>> // prevent me from hitting this multiple times
>> synchronized(shutDownLock) { if (shuttingDown) return; else shuttingDown = true; }
>>
>> // don't want to miss any of these so we'll wrap them individually
>> try { dataOut.writeByte(Slave.V_SHUTDOWN); } catch (Exception e) { } // exception, not IOException, because JZLib throws some array exceptions
>> try { dataOut.flush(); } catch (Exception e) { }
>> try { dataOut.close(); } catch (Exception e) { }
>> try { dataIn.close(); } catch (Exception e) { }
>> try { evalSocket.close(); } catch (IOException e) { }
>>
>> state.output.systemMessage( SlaveConnection.this.toString() + " is shutting down...." );
>>
>> Or?
>>
>> On Aug 26, 2013, at 14:58, Ralf Buschermöhle <[log in to unmask]> wrote:
>>
>>> BTW2: Nothing on the server ... it seems that the master does not know about the connection attempt.
>>>
>>> Running out of ports?
>>>
>>> On Aug 26, 2013, at 14:54, Ralf Buschermöhle <[log in to unmask]> wrote:
>>>
>>>> Hi,
>>>>
>>>> after the master has connected successfully some thousand slaves it can't connect anymore and each new slave receives
>>>>
>>>> FATAL ERROR (EvolutionState not created yet): java.net.SocketException: Connection reset
>>>>
>>>> Nevertheless long evolution runs are not the problem. Only connects (a few thousand).
>>>>
>>>> Any suggestions where to start digging?
>>>>
>>>> Greetings,
>>>>
>>>> Ralf
>>>>
>>>> P.S. Hm ... btw: the clients disconnect (brutally) by kicking them (running on a cluster with a wall-time).
>>>>
>>>
>>
|