I like that possible hack, will ponder it. Thinking out loud, perhaps we only need a single slave. Slaves seem to be able to have a buffer of pending jobs. If we hacked the slave high up in it's logic and just took jobs out of the buffer and then dispatched them out to distributed evaluators, we might well simplify the logic while maintaining maximum utilization of the cloud VMs. The evaluators could be completely dumb, they just 'run this single job', while the hacked ECJ slave could do the collation of the N tests, run by the many Evaluators asynchronously, prior to returning a result. One possible problem with a Single Slave approach would be if the ECJ Master <-> Slave connection assumes that the Slave is FIFO. The high variance in evaluation times strongly implies that we would like evaluation sets to be returned when the N tests per Individual were done irrespective of their position in the job queue so as to minimize "time travel" and elapsed clock time. On Sunday, October 30, 2016, Sean Luke <[log in to unmask]> wrote: > Jim, ECJ's asynchronous code does indeed assume a single test. I've not > modified it to distribute for multiple tests and would have to meditate > deeply about how to do that correctly. The challenge is that the structure > doesn't permit simple hacks like the generational evaluator would allow > [for example, making multiple copies of the same individual, then > re-merging them]. So this will take time. > > Here's a hack we might be able to rig up in the meantime: create an ECJ > slave which turns around and ships the individual off to N remote sites for > evaluation, then only returns the individual when its had its N tests done > and the fitness has been computed. Then you just run M of these slave > processes on your local machine, and have them hook up to your main ECJ > process. You can have lots of them because they're extremely lightweight. > > Sean > > On Oct 26, 2016, at 2:02 PM, Jim Rutt <[log in to unmask] <javascript:;>> > wrote: > > > Good thoughts Eric. > > > > however: > > > > 1. I could specify multicore VMs on the cloud service, BUT that would > be wasteful of paid for CPU. For example: suppose I set the Test N Times > to 4, and used a 4 core VM ... because of the very substantial variance in > run times of the simulations, there would be unused cores as the > evaluations than ran to the shorter end of the duration curve finished. > Economic efficiency is one of the key design goals for the project.\: only > pay for core that you are actively running. > > > > My plan is to use singe core VMs. Interestingly, the cloud services > charge "per core" and a 4 core VM costs exactly 4 times the cost of a 1 > core VM. To minimize paying for unused cores it sure seems that 1 core > VMs makes the most sense. I've also heard anecdotally that there is > sometimes inadequate CPU<->Memory bandwidth available in the multicore VMs, > and that could be a choke point for my compute and memory_access bound > simulator. > > > > Another problem is that it is likely I'd want to gradually increase the > Test N Times number as evolution progresses, making efficient core use even > more difficult. > > > > 2. I could try and use a large population instead of doing doing > multiple tests, but I hate to restrict my search space for Evolutionary > Algorithms that way. > > > > I currently have my own home-brew python evolver that is parrallelized > to the extent that it can run multiple evaluations in parallel on a single > multi-core machine, not too hard to extend that to multiple systems, BUT it > would be nice to get all the other stuff that comes with ECJ. > > > > Perhaps another approach would be to extend (ie hack) the single machine > steady-state system to make it asynchronous and write my own distributed > evaluation runner. > > > > I don't need any of the fancy distributed evolution that the ECJ slaves > can do: just a simple running of a shell level command and the capture and > parse of the console output and then return the Win/Lose/Draw data back to > the Master. If I did it myself, rather than using sockets for > communication (always a potential source of trouble!), I'd use one of the > queue services that the Cloud Platform provides. > > > > Anybody have any opinions on how hard it would be to make the ECJ > steady-state mechanism asynchronous? > > > > On Tue, Oct 25, 2016 at 2:29 PM, Eric 'Siggy' Scott <[log in to unmask] > <javascript:;>> wrote: > > Jim, > > > > Tis an interesting problem you raise. It's true that the "hack" ECJ's > generational algorithms use for re-evaluation doesn't make any sense in a > steady-state model. > > > > I don't know how people have handled multiple tests in parallel > steady-state EAs before. I don't recall seeing any discussion of it in the > literature. > > > > It seems to me that there are two options, though, that could save you > the trouble of implementing a distribution multiple-testing scheme: > > • If your evaluation function doesn't exhaust all of a node's > resources, run multiple tests in parallel on the same node. This is easy > to do inside your implementation of the Problem class. > > > > Your heavy-duty simulations probably eat up all your nodes' processors, > though, so this might not help your application. > > > > • Ramp up the population size. In some cases, given the same > computational resources, using a large population can be just as effective > at washing out the effects of noise as multiple testing. > > > > You can see if your application falls into this category by using a > fixed budget of fitness evaluations and seeing if it makes more progress > with a big population, or with multiple testing. If the latter truly works > much better, then that's a sign that it could be worth your effort to > modify ECJ's steady-state master-slave model to support distributed > multiple testing. > > Just my two cents. Sean et all will be more familiar with what it might > take to implement the feature itself. > > > > Siggy > > > > On Tue, Oct 25, 2016 at 1:34 PM, Jim Rutt <[log in to unmask] > <javascript:;>> wrote: > > I've been evaluating ECJ for possible use in a large scale cloud > computing based evolutionary computation project for the optimization of > AIs in highly complex wargames. > > > > What makes this a hard problem is that: > > > > 1. The evaluations are expensive - a mean of 400 seconds per evaluation > on a one core 3.5 ghz processor. > > 2. The evaluations are noisy - a better AI can still lose to worse AI, > and often does > > 3. The evaluation run times also have a large variance from > approximately 80 seconds up to 1000 seconds. > > > > As evolutionary approaches, I'm leaning to steady-state EDA type > algorithms as a seemingly good fit for the problem domain. > > > > All was looking good in the evaluation of ECJ until what seems like a > fatal problem in the last sentence of section 6.1.6 Noisy Distributed > Problems in the ECJ Owners manual : > > > > "There’s no equivalent to this hack in Asynchronous Evolution: you’ll > just have to ask a machine to test the individual 5 times." > > > > Unfortunately that would seem to significantly reduce the ability to fan > out evaluations to reduce elapsed clock time per evaluation which would > significantly increase "time travel" - ie where evaluated individuals > re-enter a population as candidates for inclusion at a much later time than > they were created for evaluation. > > > > Is another hack possible to spread out evaluations where one needs to > run multiple tests to get a good-enough estimator of an individual? i > might even be willing to do the hacking. > > > > > > > > > > -- > > Jim Rutt > > JPR Ventures > > > > > > > > -- > > > > Ph.D student in Computer Science, George Mason University > > CFO and Web Director, Journal of Mason Graduate Research > > http://mason.gmu.edu/~escott8/ > > > > > > > > -- > > =========================== > > Jim Rutt > > JPR Ventures > -- =========================== Jim Rutt JPR Ventures