Print

Print


I will analyse some memory dumps to identify possible causes. Maybe it's only related to threads ...

On Aug 26, 2013, at 19:02, Ralf Buschermöhle <[log in to unmask]> wrote:

> Hi,
> 
> sorry, it seems I missed the exception in the log, after around ~ 2000 connects & disconnects I receive
> 
> Exception in thread "SlaveMonitor::    " java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:657)
> 	at ec.eval.SlaveConnection.buildThreads(SlaveConnection.java:161)
> 	at ec.eval.SlaveConnection.<init>(SlaveConnection.java:82)
> 	at ec.eval.SlaveMonitor.registerSlave(SlaveMonitor.java:220)
> 	at ec.eval.SlaveMonitor$1.run(SlaveMonitor.java:192)
> 	at java.lang.Thread.run(Thread.java:679)
> 
> In the test case I started and shut down the slaves immediately (after 2s) - but I received the same exception in the log of the cluster nodes and there are usually hours between starting and stopping the slaves.
> 
> Greetings,
> 
> 	Ralf
> 
> 
> P.S. Just for completion 
> 
> The others (regarding the socket(s)) are:
> 
> java.net.SocketException: Connection reset
> 	at java.net.SocketInputStream.read(SocketInputStream.java:185)
> 	at java.net.SocketInputStream.read(SocketInputStream.java:199)
> 	at java.io.DataInputStream.readByte(DataInputStream.java:265)
> 	at ec.eval.SlaveConnection.readLoop(SlaveConnection.java:260)
> 	at ec.eval.SlaveConnection$1.run(SlaveConnection.java:150)
> 
> at "dataOut.writeByte(Slave.V_SHUTDOWN)"
> 
> java.net.SocketException: Broken pipe
> 	at java.net.SocketOutputStream.socketWrite0(Native Method)
> 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
> 	at java.net.SocketOutputStream.write(SocketOutputStream.java:132)
> 	at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
> 	at ec.eval.SlaveConnection.shutdown(SlaveConnection.java:99)
> 	at ec.eval.SlaveConnection.readLoop(SlaveConnection.java:327)
> 	at ec.eval.SlaveConnection$1.run(SlaveConnection.java:150)
> 
> On Aug 26, 2013, at 17:20, Sean Luke <[log in to unmask]> wrote:
> 
>> On Aug 26, 2013, at 10:27 AM, Ralf Buschermöhle wrote:
>> 
>>> These are just the running nodes. Previously there have been a few thousand connections from a cluster (handled successfully).
>> 
>> Some more.  Try executing the following command on your BSD box to see how many sockets and files (combined) you can have open at one time:
>> 
>> 	sysctl kern.maxfilesperproc
>> 
>> On my Mac (a BSD box) I get around 10K.
>> 
>> I wonder if ECJ isn't properly closing the sockets, and so you're hitting a socket limit by repeatedly adding and removing clients.  It looks correct to me though.
>> 
>> Sean
>