On Aug 27, 2013, at 21:39, Sean Luke <[log in to unmask]> wrote:
> At present ECJ's model is that once a slave connects to a master, it stays there indefinitely: typically it is the master which disconnects with the slave. The only way for the slave to disconnect with the master is to just drop connection (which is usually due to an error, like the slave bombing or something).
> I suppose you could hack up some signal that the slave could send the ECJ master to let it know that the slave is disconnecting; but before you started mucking about with the evaluation procedure, you should first ask yourself WHY a slave would disconnect. Is there some real, honest to goodness reason why you need to have slaves hook up and then drop all the time?
Actually, in my case the slave is interrupted by hitting the "wall(-time)" of the scheduler (of the Open Grid). In this case the client does not know what hits him ;)
The reason I have (many) small jobs is that they are often fast(er) scheduled - and filling the gaps.
> At any rate, if the slave cuts connection, it triggers and exception in the SlaveConnection which calls shutdown(...), and as you can see from shutdown(...), it's fairly fast.
Yes ... that's a strange thing ... sometimes it's nearly instantly ... sometimes it takes a while (like 30 seconds for one node - hm ... i think that is nearly the time of one generation step). I did not dig deeper into the algorithms ... but maybe I can influence the "way" the slaves are processed without loosing speed?
> Maybe you could do the interrupt and join before you reschedule the jobs; this might be faster but I'm not sure if the logic would be right.
I was thinking the same but (also) not sure about the logic.
For a first fix I will add some hardware - but the cluster is large ... and I am curious to see it's power ;)
Thanks as always,
> On Aug 27, 2013, at 9:14 AM, Ralf Buschermöhle wrote:
>> ok - I think I got the explanation. It's a temporal "thing" ;)
>> I use a large number of slaves - up to 600 - disconnecting and connecting sometimes fast (200 disconnects ... and 200 new connects ... nearly instantly).
>> Disconnecting takes quite some time (in particular unregistering slaves and in particular(2) rescheduling jobs).
>> Currently the master node has only 8GB Memory - and I use roughly 4GB for the Java VM. Thus disconnecting and connecting a large number of clients runs into memory problems.
>> Is is possible to speed up the disconnects? Albeit the obvious solution to add some hardware.
>> On Aug 26, 2013, at 23:15, Sean Luke <[log in to unmask]> wrote:
>>> Where to start looking:
>>> When a slave makes an incoming connection, the SlaveMonitor creates a SlaveConnection object to handle it. This object has two threads: a read thread and a write thread. When this object is destroyed because the connection has broken, (the shutdown() method), these threads are interrupted, which causes them to drop out of their loop. But they're not joined or set to null -- it's assumed that the threads will die on their own and that the SlaveConnection object will get GC'd, so null-setting is unnecessary.
>>> Perhaps one of these assumptions is wrong, so we're piling up threads. Try changing the SlaveConnection.shutdown() synchronization section to look like this:
>>> // notify my threads now that I've closed stuff in case they're still waiting
>>> // stuff added by Sean
>>> reader = writer = null; // let GC
>>> Notice I added in joining and null-setting. This will probably have no effect, but if things magically start working it tells us that a thread isn't ever finishing or isn't being set to null. From there you might try to figure out why.
>>> On Aug 26, 2013, at 3:17 PM, Ralf Buschermöhle wrote:
>>>> I will analyse some memory dumps to identify possible causes. Maybe it's only related to threads ...
>>>> On Aug 26, 2013, at 19:02, Ralf Buschermöhle <[log in to unmask]> wrote:
>>>>> sorry, it seems I missed the exception in the log, after around ~ 2000 connects & disconnects I receive
>>>>> Exception in thread "SlaveMonitor:: " java.lang.OutOfMemoryError: unable to create new native thread
>>>>> at java.lang.Thread.start0(Native Method)
>>>>> at java.lang.Thread.start(Thread.java:657)
>>>>> at ec.eval.SlaveConnection.buildThreads(SlaveConnection.java:161)
>>>>> at ec.eval.SlaveConnection.<init>(SlaveConnection.java:82)
>>>>> at ec.eval.SlaveMonitor.registerSlave(SlaveMonitor.java:220)
>>>>> at ec.eval.SlaveMonitor$1.run(SlaveMonitor.java:192)
>>>>> at java.lang.Thread.run(Thread.java:679)