I will analyse some memory dumps to identify possible causes. Maybe it's only related to threads ...
On Aug 26, 2013, at 19:02, Ralf Buschermöhle <[log in to unmask]> wrote:
> Hi,
>
> sorry, it seems I missed the exception in the log, after around ~ 2000 connects & disconnects I receive
>
> Exception in thread "SlaveMonitor:: " java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:657)
> at ec.eval.SlaveConnection.buildThreads(SlaveConnection.java:161)
> at ec.eval.SlaveConnection.<init>(SlaveConnection.java:82)
> at ec.eval.SlaveMonitor.registerSlave(SlaveMonitor.java:220)
> at ec.eval.SlaveMonitor$1.run(SlaveMonitor.java:192)
> at java.lang.Thread.run(Thread.java:679)
>
> In the test case I started and shut down the slaves immediately (after 2s) - but I received the same exception in the log of the cluster nodes and there are usually hours between starting and stopping the slaves.
>
> Greetings,
>
> Ralf
>
>
> P.S. Just for completion
>
> The others (regarding the socket(s)) are:
>
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:185)
> at java.net.SocketInputStream.read(SocketInputStream.java:199)
> at java.io.DataInputStream.readByte(DataInputStream.java:265)
> at ec.eval.SlaveConnection.readLoop(SlaveConnection.java:260)
> at ec.eval.SlaveConnection$1.run(SlaveConnection.java:150)
>
> at "dataOut.writeByte(Slave.V_SHUTDOWN)"
>
> java.net.SocketException: Broken pipe
> at java.net.SocketOutputStream.socketWrite0(Native Method)
> at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
> at java.net.SocketOutputStream.write(SocketOutputStream.java:132)
> at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
> at ec.eval.SlaveConnection.shutdown(SlaveConnection.java:99)
> at ec.eval.SlaveConnection.readLoop(SlaveConnection.java:327)
> at ec.eval.SlaveConnection$1.run(SlaveConnection.java:150)
>
> On Aug 26, 2013, at 17:20, Sean Luke <[log in to unmask]> wrote:
>
>> On Aug 26, 2013, at 10:27 AM, Ralf Buschermöhle wrote:
>>
>>> These are just the running nodes. Previously there have been a few thousand connections from a cluster (handled successfully).
>>
>> Some more. Try executing the following command on your BSD box to see how many sockets and files (combined) you can have open at one time:
>>
>> sysctl kern.maxfilesperproc
>>
>> On my Mac (a BSD box) I get around 10K.
>>
>> I wonder if ECJ isn't properly closing the sockets, and so you're hitting a socket limit by repeatedly adding and removing clients. It looks correct to me though.
>>
>> Sean
>
|