LISTSERV - ECJ-INTEREST-L Archives

ECJ-INTEREST-L Archives

August 2013

ECJ-INTEREST-L@LISTSERV.GMU.EDU

	LISTSERV Archives
	ECJ-INTEREST-L Home
	ECJ-INTEREST-L August 2013

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: master / slave
From:	Ralf Buschermöhle <[log in to unmask]>
Reply To:	ECJ Evolutionary Computation Toolkit <[log in to unmask]>
Date:	Mon, 26 Aug 2013 16:27:12 +0200
Content-Type:	multipart/signed
Parts/Attachments:	text/plain (3557 bytes) , signature.asc (494 bytes)

Hi,

On Aug 26, 2013, at 15:47, Sean Luke <[log in to unmask]> wrote:

> This sounds a *lot* like your process is running out of ports; that is, it's probably not an ECJ error or even a Java problem but rather an OS problem.  But at just 129 slaves?  

These are just the running nodes. Previously there have been a few thousand connections from a cluster (handled successfully).

> That seems really low to me. We regularly handle rather more than that.
> 
> What OS

Masterside: FreeBSD, Clients on Debian / Scientific Linux (afaik all latest stable versions)

> and Java version are you running?  

Master:
OpenJDK Runtime Environment (build 1.6.0_32-b27), 64-Bit Server VM (build 20.0-b12, mixed mode)

Client Setup 1:
java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)

Client Setup 2:
openjdk version "1.6.0_32"
OpenJDK Runtime Environment (build 1.6.0_32-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

> And which ECJ version?

21

> 
> shutdown() only happens if either the readLoop or the writeLoop of the underlying slave connection has had its connection broken.  What's the exception being thrown?  In writeLoop and readLoop, change 
> 
> 	shutdown(state);
> 
> to
> 
> 	e.printStackTrace(); shutdown(state);

Ok ... just changed the code so that the exception trace will be printed. 

I will report back ... the current computation needs approx. 2 hours before hitting the next checkpoint and then I'll restart the computation nodes.

Thanks,

	Ralf

> On Aug 26, 2013, at 9:23 AM, Ralf Buschermöhle wrote:
> 
>> On the server side it's a "normal" disconnect (i modified the output a little)  
>> 
>> Slave is shutting down....
>> 2013-08-26 13:14:16.758 Slave exits, connected slaves: 129
>> 
>> and everything should be closed. Snipped from SlaveConnection.java
>> 
>> protected void shutdown( final EvolutionState state )
>>       {
>>       // prevent me from hitting this multiple times
>>       synchronized(shutDownLock) { if (shuttingDown) return; else shuttingDown = true; }
>> 
>>       // don't want to miss any of these so we'll wrap them individually
>>       try { dataOut.writeByte(Slave.V_SHUTDOWN); } catch (Exception e) { }  // exception, not IOException, because JZLib throws some array exceptions
>>       try { dataOut.flush(); } catch (Exception e) { }
>>       try { dataOut.close(); } catch (Exception e) { }
>>       try { dataIn.close(); } catch (Exception e) { }
>>       try { evalSocket.close(); } catch (IOException e) { }
>> 
>>       state.output.systemMessage( SlaveConnection.this.toString() + " is shutting down...." );
>> 
>> Or?
>> 
>> On Aug 26, 2013, at 14:58, Ralf Buschermöhle <[log in to unmask]> wrote:
>> 
>>> BTW2: Nothing on the server ... it seems that the master does not know about the connection attempt.
>>> 
>>> Running out of ports?
>>> 
>>> On Aug 26, 2013, at 14:54, Ralf Buschermöhle <[log in to unmask]> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> after the master has connected successfully some thousand slaves it can't connect anymore and each new slave receives 
>>>> 
>>>> FATAL ERROR (EvolutionState not created yet): java.net.SocketException: Connection reset
>>>> 
>>>> Nevertheless long evolution runs are not the problem. Only connects (a few thousand).
>>>> 
>>>> Any suggestions where to start digging? 
>>>> 
>>>> Greetings,
>>>> 
>>>> 	Ralf
>>>> 
>>>> P.S. Hm ... btw: the clients disconnect (brutally) by kicking them (running on a cluster with a wall-time).
>>>> 
>>> 
>>

ATOM RSS1 RSS2

LISTSERV.GMU.EDU