LISTSERV - ECJ-INTEREST-L Archives

ECJ-INTEREST-L Archives

January 2008

ECJ-INTEREST-L@LISTSERV.GMU.EDU

	LISTSERV Archives
	ECJ-INTEREST-L Home
	ECJ-INTEREST-L January 2008

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: random seeding
From:	Michael Wilson <[log in to unmask]>
Reply To:	ECJ Evolutionary Computation Toolkit <[log in to unmask]>
Date:	Mon, 21 Jan 2008 11:05:43 +0000
Content-Type:	text/plain
Parts/Attachments:	text/plain (516 lines)
Sean Luke wrote:
>> If you create several instances in quick succession you get identical
>> RNGs, and System.currentTimeMillis() doesn't have much entropy anyway.
> 
> I think this notion betrays a bit of cargo-cult programming, and it's 
> worth explaining why.

Entropy and distinctiveness are different concepts. In some applications
(e.g. cryptography) the former is important. However that's not the
issue here. The issue is that MersenneTwister has a surprisingly long
warmup time, which requires a decent seed to overcome. Statistical
testing indicated that the ECJ implementation has some serious
shortcomings in this respect.

> - nanoTime is not available on earlier Java versions,

True. But if you're still using a VM from 2003, you have other
problems. The only major group who may not have gotten around to
upgrading yet are OS-X users. nanoTime() in particular is very useful
for performance measurements; you can use it reliably on much smaller
intervals than getCurrentTimeMillis().

 > *and* it's not guaranteed to work at all.

Also true, but not a problem in this case; the theoretical default
behaviour is to wrap another call to getCurrentTimeMillis(), and we're
just one source of entropy down. However I have yet to encounter a
Java platform on which one might reasonably want to do GP research
that does not implement nanoTime(). Our class in particular is fairly
specific anyway; it uses 64-bit algorithms for better performance on
64-bit platforms, which anyone interested in performance will be using
anyway.

> - hashCode is the wrong function.  It should be
 > System.identityHashCode(...).

That won't make any difference in either of these cases, because
neither FastRNG or the MersenneTwister classes override hashCode.
However it is semantically clearer so I will include it anyway, thanks.

> It's also not as random as you think: 
> it's word aligned and sequentially packed on some systems.

It isn't random at all. It's merely moderately unpredictable. Neither
are keypress timings, mouse movements or hard disk activity. But these
are all perfectly good sources of seed entropy (in the relevant
contexts).

>> This produces a reasonable amount of entropy
>> for experimental purposes (we have a couple of 'high quality' seeding
>> methods too but they take longer to run).
> 
> I think this may be mistaken.  It's true that if you used a crummy 
> generator with a small seed (such as c's rand()), providing a good 
> entropy seed would be important.

A good entropy seed is always important to minimise sequence
predictability. You are correct that it is not important for the
statistical quality of a sequence output by a 'warmed up' RNG, but as
I am about to demonstrate, the 'warmup' required to get the ECJ
implementation to output high-quality results can be on the order of
/one million samples/. This is unacceptable regardless of absolute
predictability issues (which /are/ relevant in one of our non-GP
applications that uses this class). Note that inexperienced users
can instantiate large numbers of RNGs (e.g. one per individual, or
one per generation/subpop) and take only a few values (or even just
one value!) from each. A high quality seed algorithm means that the
user will still get statistically decent results (though with much
lower performance of course) even in this worst case.

> But a crummy generator MT is not. Generally speaking, providing seed
 > entropy shouldn't have almost any effect at all.  It's waving a dead
 > chicken.

Demonstrably untrue. Here are some statistics for ten million calls to
nextByte() on new instances;

Samples         = 10000000 bytes
Entropy         = 7.99998333243039 bits per byte
Compressibility = 2.0834462012286892E-4 percent
Chi Square dist = 231.1654912
Chi Square prob = 0.75 (target = 0.5)
Arithmetic mean = 127.4779674 (expected 127.5)
Monte Carlo Pi  = 3.1419420567768226 (error 0.01112184886956054 percent)
Serial CC       = 4.725454754310679E-6 (uncorrelated = 0.0).

ec.util.MersenneTwister

Samples         = 10000000 bytes
Entropy         = 7.977377663332874 bits per byte
Compressibility = 0.282779208339079 percent
Chi Square dist = 314521.36238079995
Chi Square prob = 1.0E-4 (target = 0.5)
Arithmetic mean = 127.4029672 (expected 127.5)
Monte Carlo Pi  = 2.8205531282212513 (error 10.219005478055875 percent)
Serial CC       = 0.9990445683267967 (uncorrelated = 0.0).

java.util.Random

Samples         = 10000000 bytes
Entropy         = 7.999888861410743 bits per byte
Compressibility = 0.0013892323657116457 percent
Chi Square dist = 1537.6350719999996
Chi Square prob = 1.0E-4 (target = 0.5)
Arithmetic mean = 127.7022814 (expected 127.5)
Monte Carlo Pi  = 3.1534812613925047 (error 0.378426139656486 percent)
Serial CC       = 4.836986086728444E-5 (uncorrelated = 0.0).

As you can see, the performance of the ec.util.MersenneTwister class is
fairly abysmal, considerably worse than java.util.Random. Of course this
is an absolute worst case; there are huge long runs of
currentTimeMillis() returning exactly the same seed. However the other
two RNGs can deal with this just fine, ECJ's can't. Here are the
statistics for a somewhat more plausible use case, 100 samples from
e100,000 instances;

com.purplefrogtext.gpserver.FastRNG

Samples         = 10000000 bytes
Entropy         = 7.999983631419931 bits per byte
Compressibility = 2.046072508643526E-4 percent
Chi Square dist = 226.93775359999995
Chi Square prob = 0.75 (target = 0.5)
Arithmetic mean = 127.5121393 (expected 127.5)
Monte Carlo Pi  = 3.1410588564235424 (error 0.01699129152344847 percent)
Serial CC       = 1.919561334005916E-4 (uncorrelated = 0.0).

ec.util.MersenneTwister

Samples         = 10000000 bytes
Entropy         = 7.988352403809378 bits per byte
Compressibility = 0.1455949523827771 percent
Chi Square dist = 160946.93288959996
Chi Square prob = 1.0E-4 (target = 0.5)
Arithmetic mean = 127.9862915 (expected 127.5)
Monte Carlo Pi  = 3.136556454622582 (error 0.16030719200519186 percent)
Serial CC       = 0.008932329043214497 (uncorrelated = 0.0).

java.util.Random

Samples         = 10000000 bytes
Entropy         = 7.999980355066089 bits per byte
Compressibility = 2.455616738838984E-4 percent
Chi Square dist = 272.2516992
Chi Square prob = 0.25 (target = 0.5)
Arithmetic mean = 127.4523454 (expected 127.5)
Monte Carlo Pi  = 3.144030057612023 (error 0.07758497968999184 percent)
Serial CC       = 1.9457666812929878E-4 (uncorrelated = 0.0).

ec.util.MersenneTwister is still trailing java.util.Random on all
metrics. Note that 100 samples is still small enough to have less value
bins than there are in the sample space (100 samples vs 256 bins) and
instantiations are temporally close enough that there are still long
runs of MT instances with the exact same seed. However as I will now
demonstrate the problems with ECJ's implementation of MT persist even
once you have enough samples to make these issues irrelevant;

> Mersenne Twister has an internal state of 624 longs.  If you fill those 
> longs with a poor-entropy value, or successive poor-entropy values, 
> after just a few pumps (probably just one pump), you will *not* be able 
> to notice from the MT output.

Oh really? Let's try 2000 samples (enough for three 'pumps') each from
5,000 instances;

com.purplefrogtext.gpserver.FastRNG

Samples         = 10000000 bytes
Entropy         = 7.999981427842907 bits per byte
Compressibility = 2.3215196366388469E-4 percent
Chi Square dist = 257.4293504
Chi Square prob = 0.5 (target = 0.5)
Arithmetic mean = 127.5888985 (expected 127.5)
Monte Carlo Pi  = 3.138675655470262 (error 0.09285093394262264 percent)
Serial CC       = 6.943699324379059E-4 (uncorrelated = 0.0).

ec.util.MersenneTwister

Samples         = 10000000 bytes
Entropy         = 7.998478903440615 bits per byte
Compressibility = 0.019013706992310198 percent
Chi Square dist = 21117.694054399988
Chi Square prob = 1.0E-4 (target = 0.5)
Arithmetic mean = 127.7651855 (expected 127.5)
Monte Carlo Pi  = 3.1336044534417815 (error 0.25427230799269135 percent)
Serial CC       = 0.002702179075196308 (uncorrelated = 0.0).

java.util.Random

Samples         = 10000000 bytes
Entropy         = 7.999977990076586 bits per byte
Compressibility = 2.7512404267016066E-4 percent
Chi Square dist = 305.1845632000001
Chi Square prob = 0.025 (target = 0.5)
Arithmetic mean = 127.4706723 (expected 127.5)
Monte Carlo Pi  = 3.1422516569006627 (error 0.020976726887762566 percent)
Serial CC       = 2.753449795306269E-5 (uncorrelated = 0.0).

Nope, ec.util.MersenneTwister is still losing against java.util.Random
on all metrics. Note that I'm using relatively simple tests here because
I don't want to waste all morning on this; they can't detect the fact
that many of the MersenneTwister instances are returning exactly the
same sequence (because System.currentTimeMillis() has fairly sucky
resolution). If I was using the full test suite, MersenneTwister would
be performing much, much worse. Anyway, here's 100,000 samples each from
1000 instances;

com.purplefrogtext.gpserver.FastRNG

Samples         = 100000000 bytes
Entropy         = 7.999998244340672 bits per byte
Compressibility = 2.194574160174767E-5 percent
Chi Square dist = 243.35768064
Chi Square prob = 0.5 (target = 0.5)
Arithmetic mean = 127.49580098 (expected 127.5)
Monte Carlo Pi  = 3.1419982856799313 (error 0.01291167044443897 percent)
Serial CC       = 4.789864457056014E-5 (uncorrelated = 0.0).

ec.util.MersenneTwister

Samples         = 100000000 bytes
Entropy         = 7.999992526963659 bits per byte
Compressibility = 9.341295426068541E-5 percent
Chi Square dist = 1036.1470259199996
Chi Square prob = 1.0E-4 (target = 0.5)
Arithmetic mean = 127.50160563 (expected 127.5)
Monte Carlo Pi  = 3.1415667656626707 (error 8.24038313587706E-4 percent)
Serial CC       = 6.351734037428386E-5 (uncorrelated = 0.0).

java.util.Random

Samples         = 100000000 bytes
Entropy         = 7.999998375272339 bits per byte
Compressibility = 2.03090957606733E-5 percent
Chi Square dist = 225.23552767999993
Chi Square prob = 0.9 (target = 0.5)
Arithmetic mean = 127.50565567 (expected 127.5)
Monte Carlo Pi  = 3.140841725633669 (error 0.02390277922460475 percent)
Serial CC       = 5.0981060362350845E-5 (uncorrelated = 0.0).

Finally, ec.util.MersenneTwister has caught up with java.util.Random;
it has the same order of magnitude on serial correlation and entropy.
Only the (extremely sensitive) chi square test is still detecting
significant regularities in ec.util.MersenneTwister. FastRNG is still
considerably ahead, despite the fact that it uses a very similar RNG
algorithm (just with a different seed). At 1,000,000 samples from 100
instances, the quality of MersenneTwister's output finally catches up
with FastRNG's;

com.purplefrogtext.gpserver.FastRNG

Samples         = 100000000 bytes
Entropy         = 7.999998111348311 bits per byte
Compressibility = 2.3608146115794426E-5 percent
Chi Square dist = 261.8269440000002
Chi Square prob = 0.5 (target = 0.5)
Arithmetic mean = 127.5059442 (expected 127.5)
Monte Carlo Pi  = 3.1413310856532433 (error 0.008325966011251356 percent)
Serial CC       = 5.907318448570472E-5 (uncorrelated = 0.0).

ec.util.MersenneTwister

Samples         = 100000000 bytes
Entropy         = 7.999998170348357 bits per byte
Compressibility = 2.287064553296858E-5 percent
Chi Square dist = 253.65223935999998
Chi Square prob = 0.5 (target = 0.5)
Arithmetic mean = 127.50161809 (expected 127.5)
Monte Carlo Pi  = 3.1413238856529553 (error 0.008555149138470275 percent)
Serial CC       = 8.444797288330956E-5 (uncorrelated = 0.0).

java.util.Random

Samples         = 100000000 bytes
Entropy         = 7.999998593462497 bits per byte
Compressibility = 1.7581718791959133E-5 percent
Chi Square dist = 194.98636799999997
Chi Square prob = 0.995 (target = 0.5)
Arithmetic mean = 127.49867176 (expected 127.5)
Monte Carlo Pi  = 3.141109325644373 (error 0.015384806329613794 percent)
Serial CC       = 2.8187345499363348E-5 (uncorrelated = 0.0).

So, if you want to use ec.util.MersenneTwister in a serious application,
apparently you should call it 1,000,000 times to warm it up first (I
did some informal tests of this and it seems to work). Or you could
just drop in the replacement seeding algorithm and get high quality
output with no extra work.

 > It will be so random as to pass every known statistical test.

By all means, please independently replicate my tests. There may be some
obvious problem with them that I'm not seeing. I can't spare more time
this morning on this otherwise I'd use more samples, but I think these
numbers are adequate for statistical confidence.

> Even if you're still not willing to trust the math, we're not loading 
> the internal state with low-entropy values.  We're loading it with 
> relatively high-entropy values  generated as the stream output of a
> carefully-selected linear congruential generator whose seed in turn is 
> set to the initial 32-bit number you provide.

You don't appear to understand entropy. Entropy isn't the computational
cost of predicting the output given the input, it's the absolute
uncertainty in the probability distribution over possible output values
in any particular context. You can't magically create entropy from
nothing; it has to come from outside the system. You /can/ improve the
statistical quality of your output without requiring external entropy.
However it is clear that the default state vector fill algorithm in
the ECJ implementation has serious shortcomings. The easiest and
highest performing way to fix this is to use external entropy, as the
above results demonstrate (AFAIK - if anyone else has a better solution
feel free to jump in).

> Ordinarily linear congruential generators aren't fantastic to use as
 > random number generators for experiments, but they're just fine for
 > providing enough entropy

PRNGs /never/ produce entropy. You can't create uncertainty in a
deterministic system. PRNG's /transform/ their input into an output
sequence that hopefully meets various statistical criteria and in the
case of secure RNGs is computationally infeasible to deduce the PRNG's
internal state from.

> to MT so that those cargo-cult programmers who don't trust MT 
> theorists can feel like you've waved the necessary dead chickens.

It would appear that in this case, the chickens were very much alive.
Incidentally, when you reported a 'bug' in java.util.Random a few years
back, regarding 'dimensional stability' in reusing an RNG vector element
three times to generate three bytes, did you run statistical tests or
did you rely on some combination of intuition and poultry of uncertain
freshness? :)

> This means that you should be able to use seeds like this:
> 
>     2345, 2346, 2347, ...
> 
> and have pretty high-quality randomly different experimental results.  
> If you have statistical results that counter this, I (and I think a
 > fair degree of the RNG community) would really like to see them
 > published! (no really! I'm not trying to be sarcastic.  What do you 
have?)

See above.

> We chose to be 100% consistent with the MT standard rather than risk 
> wandering off the ranch in search of a seed twice as long (which only 
> benefits you if you need to do more than 2^32 experiments).

This would not be true even if the seed was a genuine random choice
between 2^32 values, which it isn't. The effective entropy in each
bit falls off steadily (logarithmically I think, but you'd have to
ask a mathematician) as you run the PRNG. Logically this is
equivlanet to the theoretical probability distribution over the internal
state, from the point of view of someone who can only see the output
(but has unbounded computing power to analyse it); you don't
suddenly go from 'completely unknown' to 'known'. However for most GP
purposes we don't really care about absolute entropy as long as it's
at least minimally decent. We /do/ however care about statistical
quality during the first million samples, and while more input entropy
may technically be overkill it's a cheap quick way to fix the problem.

This is why my implementation of defaultSeed() XORs everything together
into a 64-bit value despite having 320 bits of input; the number of
bits of actual /entropy/ in the inputs is much lower, so a 320-bit
output would be pointless.

> If you'd like something else, you should just go in and set all 624
 > initial values in the MT array: there's a function for that.

The problem there is getting the initial values. FastRNG actually has
an option to get them from Hotbits, an Internet server that uses a
radioactive decay source to generate real high-quality entropy. But
this is overkill for 99.999% of applications. Yes, you could wrap
all your MersenneTwister applications in a complex extra-high-quality
seeding algorithm. But defaultSeed() does the main job (reduces the
warmup time from ~1,000,000 to ~0) and is transparent to users.

Incidentally here's our (non-Hotbits) high quality seeder, mainly
intended for cryptographic use (this is still experimental, not in
production, and it's called immediately after the vector is initialised
with defaultSeed() & the default MT init algorithm);

     final long[] seed = new long[20];

     // Initialise low bits by hashing together system properties.
     try {

       final Properties sysprops = System.getProperties();
       final MessageDigest digest = MessageDigest.getInstance("SHA");
       for(final String pname : sysprops.stringPropertyNames()) {

         // Add a single property (including name) to the digest.
         for(int index = pname.length()-1; index>=0; index--)
           digest.update((byte)(pname.charAt(index)));
         final String pvalue = sysprops.getProperty(pname);
         if(pvalue!=null) {
           for(int index = pvalue.length()-1; index>=0; index--)
             digest.update((byte)(pvalue.charAt(index)));
         }

       }

       final byte[] output = digest.digest();
       assert output.length==20;
       for(int index = 0; index<20; index++)
         seed[index] = output[index]*SHA1_EXPAND;

     } catch(NoSuchAlgorithmException e) { }

     // Mix in some info that wasn't already used in defaultSeed().
     final Runtime rt = Runtime.getRuntime();
     seed[0] = seed[0]^rt.maxMemory();
     seed[1] = seed[1]^rt.totalMemory();
     rt.gc(); seed[2] = seed[2]^rt.freeMemory();
     seed[3] = seed[3]^rt.hashCode();

     // Start some 'scheduling noise' generators.
     int noisecount = rt.availableProcessors()*NOISE_GENERATORS;
     final FastRNG[] noises = new FastRNG[noisecount];
     final int mypri = Thread.currentThread().getPriority();
     long nseed = seed[2];
     do {
       noisecount--; nseed = nseed+seed[3];
       noises[noisecount] = new FastRNG(nseed, mypri);
     } while(noisecount>0);

     // Generate primary entropy from timing measurements.
     for(int shift = 0; shift<12; shift++) {
       for(int index = 0; index<20; index++) {

         target.bitsused = 0;
         final long end = System.nanoTime()+LOCK_COUNT_INTERVAL;
         do {
           synchronized(target) { target.bitsused++; };
         } while(System.nanoTime()<end||target.bitsused<MIN_LOCK_COUNT);
         seed[index] = seed[index]^(end<<shift)^
                       (((long)target.bitsused)<<(shift*4));

       }
     }

     // Shut down noise generators.
     for(final FastRNG noise : noises) noise.bitsused = 0;
     target.addEntropy(seed);

   }

Here's the helper method it uses;

   public void run() {

     // Do nothing if not called as part of generateEntropy.
     if(bitsused!=-1) return;

     // Loop until told to stop, sleeping pseudorandomly.
     while(true) {
       synchronized(this) {
         if(bitsused==0) return;
       }
       try { Thread.sleep(0, (int)(nextLong()&0x7FFFF)); }
       catch(InterruptedException e) { }
     }

   }

This works via hashing the system properties and extracing
timing noise from the OS's scheduler and runtime properties;
similar to Java's SecureRandom, but in our tests considerably
more effective. However it effectively locks up your system for
a couple of seconds while it initialises.

> Otherwise, there are *big* benefits to having a standard RNG.

I can't think of any other than not causing people's ears to prick up
when they read papers that describe ECJ. But even that isn't an issue
here. Our implementation still uses MT19937-64, a standard algorithm
very similar to the one ECJ uses, including the state vector
initilisation. It just has an extra input stage /before/ that which
fixes the observed warmup time issue. This is actually /more/ friendly
to independent replication than the current ECJ implementation, because
the resolution of System.currentTimeMillis() and hence the ECJ warmup
period varies significantly between /major/ platforms (i.e. old Windows,
new Windows, Linux). Our defaultSeed() isn't great (compared to a real
high quality seed algorithm), but it's good enough to reduce the warmup
time to zero or close to it on all platforms.

> 32 bits, rather than a longg is not a big deal either.  After all, 
> java.util.Random uses a only 48-bit seed rather than a long (and in fact 
> that's its entire internal state!  java.util.Random is very poor).

I don't follow your argument here. You're saying that java.util.Random
is very poor for using only 48-bits. Then you say that using 32 bits
is 'no big deal'. It's true that seed vs internal state is an apples vs
oranges comparison, but that just makes me less able to understand what
you're trying to say here.

> Now: how to initialize RNGs in sequence?  Assuming you let MT's default 
> seeder do its job, the answer then doesn't lie in seeding entropy, but 
> in simply guaranteeing that you have a different seed each time.  Just 
> grab System.currentTimeMillis(), use it for your first seed, then use 
> the same value + C for your next seed, then the next value + 2C for your 
> next seed, then +3C, etc., where C is some integer > 0.  Just maintain 
> that internal counter somewhere and you should be fine.

This is adequate to ensure that sequences are distinct. It does not fix
the warmup time issue, it does not ensure sequence unpredictability and
it's an additional burden on the user for no good reason. One would not
/expect/ temporally-adjacent calls to the default constructor to
produce identical RNG sequences (certainly java.util.Random doesn't),
so they should not.

Michael Wilson
Software Engineer
Purple Frog Text
ATOM RSS1 RSS2
LISTSERV.GMU.EDU