Good thoughts Eric.

however:

1. I could specify multicore VMs on the cloud service, BUT that would be wasteful of paid for CPU. For example: suppose I set the Test N Times to 4, and used a 4 core VM ... because of the very substantial variance in run times of the simulations, there would be unused cores as the evaluations than ran to the shorter end of the duration curve finished. Economic efficiency is one of the key design goals for the project.\: only pay for core that you are actively running.

My plan is to use singe core VMs. Interestingly, the cloud services charge "per core" and a 4 core VM costs exactly 4 times the cost of a 1 core VM. To minimize paying for unused cores it sure seems that 1 core VMs makes the most sense. I've also heard anecdotally that there is sometimes inadequate CPU<->Memory bandwidth available in the multicore VMs, and that could be a choke point for my compute and memory_access bound simulator.

Another problem is that it is likely I'd want to gradually increase the Test N Times number as evolution progresses, making efficient core use even more difficult.

2. I could try and use a large population instead of doing doing multiple tests, but I hate to restrict my search space for Evolutionary Algorithms that way.

I currently have my own home-brew python evolver that is parrallelized to the extent that it can run multiple evaluations in parallel on a single multi-core machine, not too hard to extend that to multiple systems, BUT it would be nice to get all the other stuff that comes with ECJ.

Perhaps another approach would be to extend (ie hack) the single machine steady-state system to make it asynchronous and write my own distributed evaluation runner.

I don't need any of the fancy distributed evolution that the ECJ slaves can do: just a simple running of a shell level command and the capture and parse of the console output and then return the Win/Lose/Draw data back to the Master. If I did it myself, rather than using sockets for communication (always a potential source of trouble!), I'd use one of the queue services that the Cloud Platform provides.

Anybody have any opinions on how hard it would be to make the ECJ steady-state mechanism asynchronous?

On Tue, Oct 25, 2016 at 2:29 PM, Eric 'Siggy' Scott <[log in to unmask]> wrote:

Jim,

Tis an interesting problem you raise. It's true that the "hack" ECJ's generational algorithms use for re-evaluation doesn't make any sense in a steady-state model.

I don't know how people have handled multiple tests in parallel steady-state EAs before. I don't recall seeing any discussion of it in the literature.

It seems to me that there are two options, though, that could save you the trouble of implementing a distribution multiple-testing scheme:
If your evaluation function doesn't exhaust all of a node's resources, run multiple tests in parallel on the same node. This is easy to do inside your implementation of the Problem class.

Your heavy-duty simulations probably eat up all your nodes' processors, though, so this might not help your application.

Ramp up the population size. In some cases, given the same computational resources, using a large population can be just as effective at washing out the effects of noise as multiple testing.

You can see if your application falls into this category by using a fixed budget of fitness evaluations and seeing if it makes more progress with a big population, or with multiple testing. If the latter truly works much better, then that's a sign that it could be worth your effort to modify ECJ's steady-state master-slave model to support distributed multiple testing.
Just my two cents. Sean et all will be more familiar with what it might take to implement the feature itself.

Siggy

On Tue, Oct 25, 2016 at 1:34 PM, Jim Rutt <[log in to unmask]> wrote:
I've been evaluating ECJ for possible use in a large scale cloud computing based evolutionary computation project for the optimization of AIs in highly complex wargames.

What makes this a hard problem is that:

1. The evaluations are expensive - a mean of 400 seconds per evaluation on a one core 3.5 ghz processor.
2. The evaluations are noisy - a better AI can still lose to worse AI, and often does
3. The evaluation run times also have a large variance from approximately 80 seconds up to 1000 seconds.

As evolutionary approaches, I'm leaning to steady-state EDA type algorithms as a seemingly good fit for the problem domain.

All was looking good in the evaluation of ECJ until what seems like a fatal problem in the last sentence of section 6.1.6 Noisy Distributed Problems in the ECJ Owners manual :

"There’s no equivalent to this hack in Asynchronous Evolution: you’ll just have to ask a machine to test the individual 5 times."

Unfortunately that would seem to significantly reduce the ability to fan out evaluations to reduce elapsed clock time per evaluation which would significantly increase "time travel" - ie where evaluated individuals re-enter a population as candidates for inclusion at a much later time than they were created for evaluation.

Is another hack possible to spread out evaluations where one needs to run multiple tests to get a good-enough estimator of an individual? i might even be willing to do the hacking.

--
Jim Rutt
JPR Ventures

--

Ph.D student in Computer Science, George Mason University
CFO and Web Director, Journal of Mason Graduate Research
http://mason.gmu.edu/~escott8/

===========================
Jim Rutt
JPR Ventures