ok - I think I got the explanation. It's a temporal "thing" ;)
I use a large number of slaves - up to 600 - disconnecting and connecting sometimes fast (200 disconnects ... and 200 new connects ... nearly instantly).
Disconnecting takes quite some time (in particular unregistering slaves and in particular(2) rescheduling jobs).
Currently the master node has only 8GB Memory - and I use roughly 4GB for the Java VM. Thus disconnecting and connecting a large number of clients runs into memory problems.
Is is possible to speed up the disconnects? Albeit the obvious solution to add some hardware.
On Aug 26, 2013, at 23:15, Sean Luke <[log in to unmask]> wrote:
> Where to start looking:
> When a slave makes an incoming connection, the SlaveMonitor creates a SlaveConnection object to handle it. This object has two threads: a read thread and a write thread. When this object is destroyed because the connection has broken, (the shutdown() method), these threads are interrupted, which causes them to drop out of their loop. But they're not joined or set to null -- it's assumed that the threads will die on their own and that the SlaveConnection object will get GC'd, so null-setting is unnecessary.
> Perhaps one of these assumptions is wrong, so we're piling up threads. Try changing the SlaveConnection.shutdown() synchronization section to look like this:
> // notify my threads now that I've closed stuff in case they're still waiting
> // stuff added by Sean
> reader = writer = null; // let GC
> Notice I added in joining and null-setting. This will probably have no effect, but if things magically start working it tells us that a thread isn't ever finishing or isn't being set to null. From there you might try to figure out why.
> On Aug 26, 2013, at 3:17 PM, Ralf Buschermöhle wrote:
>> I will analyse some memory dumps to identify possible causes. Maybe it's only related to threads ...
>> On Aug 26, 2013, at 19:02, Ralf Buschermöhle <[log in to unmask]> wrote:
>>> sorry, it seems I missed the exception in the log, after around ~ 2000 connects & disconnects I receive
>>> Exception in thread "SlaveMonitor:: " java.lang.OutOfMemoryError: unable to create new native thread
>>> at java.lang.Thread.start0(Native Method)
>>> at java.lang.Thread.start(Thread.java:657)
>>> at ec.eval.SlaveConnection.buildThreads(SlaveConnection.java:161)
>>> at ec.eval.SlaveConnection.<init>(SlaveConnection.java:82)
>>> at ec.eval.SlaveMonitor.registerSlave(SlaveMonitor.java:220)
>>> at ec.eval.SlaveMonitor$1.run(SlaveMonitor.java:192)
>>> at java.lang.Thread.run(Thread.java:679)