Sean Luke wrote: >> If you create several instances in quick succession you get identical >> RNGs, and System.currentTimeMillis() doesn't have much entropy anyway. > > I think this notion betrays a bit of cargo-cult programming, and it's > worth explaining why. Entropy and distinctiveness are different concepts. In some applications (e.g. cryptography) the former is important. However that's not the issue here. The issue is that MersenneTwister has a surprisingly long warmup time, which requires a decent seed to overcome. Statistical testing indicated that the ECJ implementation has some serious shortcomings in this respect. > - nanoTime is not available on earlier Java versions, True. But if you're still using a VM from 2003, you have other problems. The only major group who may not have gotten around to upgrading yet are OS-X users. nanoTime() in particular is very useful for performance measurements; you can use it reliably on much smaller intervals than getCurrentTimeMillis(). > *and* it's not guaranteed to work at all. Also true, but not a problem in this case; the theoretical default behaviour is to wrap another call to getCurrentTimeMillis(), and we're just one source of entropy down. However I have yet to encounter a Java platform on which one might reasonably want to do GP research that does not implement nanoTime(). Our class in particular is fairly specific anyway; it uses 64-bit algorithms for better performance on 64-bit platforms, which anyone interested in performance will be using anyway. > - hashCode is the wrong function. It should be > System.identityHashCode(...). That won't make any difference in either of these cases, because neither FastRNG or the MersenneTwister classes override hashCode. However it is semantically clearer so I will include it anyway, thanks. > It's also not as random as you think: > it's word aligned and sequentially packed on some systems. It isn't random at all. It's merely moderately unpredictable. Neither are keypress timings, mouse movements or hard disk activity. But these are all perfectly good sources of seed entropy (in the relevant contexts). >> This produces a reasonable amount of entropy >> for experimental purposes (we have a couple of 'high quality' seeding >> methods too but they take longer to run). > > I think this may be mistaken. It's true that if you used a crummy > generator with a small seed (such as c's rand()), providing a good > entropy seed would be important. A good entropy seed is always important to minimise sequence predictability. You are correct that it is not important for the statistical quality of a sequence output by a 'warmed up' RNG, but as I am about to demonstrate, the 'warmup' required to get the ECJ implementation to output high-quality results can be on the order of /one million samples/. This is unacceptable regardless of absolute predictability issues (which /are/ relevant in one of our non-GP applications that uses this class). Note that inexperienced users can instantiate large numbers of RNGs (e.g. one per individual, or one per generation/subpop) and take only a few values (or even just one value!) from each. A high quality seed algorithm means that the user will still get statistically decent results (though with much lower performance of course) even in this worst case. > But a crummy generator MT is not. Generally speaking, providing seed > entropy shouldn't have almost any effect at all. It's waving a dead > chicken. Demonstrably untrue. Here are some statistics for ten million calls to nextByte() on new instances; Samples = 10000000 bytes Entropy = 7.99998333243039 bits per byte Compressibility = 2.0834462012286892E-4 percent Chi Square dist = 231.1654912 Chi Square prob = 0.75 (target = 0.5) Arithmetic mean = 127.4779674 (expected 127.5) Monte Carlo Pi = 3.1419420567768226 (error 0.01112184886956054 percent) Serial CC = 4.725454754310679E-6 (uncorrelated = 0.0). ec.util.MersenneTwister Samples = 10000000 bytes Entropy = 7.977377663332874 bits per byte Compressibility = 0.282779208339079 percent Chi Square dist = 314521.36238079995 Chi Square prob = 1.0E-4 (target = 0.5) Arithmetic mean = 127.4029672 (expected 127.5) Monte Carlo Pi = 2.8205531282212513 (error 10.219005478055875 percent) Serial CC = 0.9990445683267967 (uncorrelated = 0.0). java.util.Random Samples = 10000000 bytes Entropy = 7.999888861410743 bits per byte Compressibility = 0.0013892323657116457 percent Chi Square dist = 1537.6350719999996 Chi Square prob = 1.0E-4 (target = 0.5) Arithmetic mean = 127.7022814 (expected 127.5) Monte Carlo Pi = 3.1534812613925047 (error 0.378426139656486 percent) Serial CC = 4.836986086728444E-5 (uncorrelated = 0.0). As you can see, the performance of the ec.util.MersenneTwister class is fairly abysmal, considerably worse than java.util.Random. Of course this is an absolute worst case; there are huge long runs of currentTimeMillis() returning exactly the same seed. However the other two RNGs can deal with this just fine, ECJ's can't. Here are the statistics for a somewhat more plausible use case, 100 samples from e100,000 instances; com.purplefrogtext.gpserver.FastRNG Samples = 10000000 bytes Entropy = 7.999983631419931 bits per byte Compressibility = 2.046072508643526E-4 percent Chi Square dist = 226.93775359999995 Chi Square prob = 0.75 (target = 0.5) Arithmetic mean = 127.5121393 (expected 127.5) Monte Carlo Pi = 3.1410588564235424 (error 0.01699129152344847 percent) Serial CC = 1.919561334005916E-4 (uncorrelated = 0.0). ec.util.MersenneTwister Samples = 10000000 bytes Entropy = 7.988352403809378 bits per byte Compressibility = 0.1455949523827771 percent Chi Square dist = 160946.93288959996 Chi Square prob = 1.0E-4 (target = 0.5) Arithmetic mean = 127.9862915 (expected 127.5) Monte Carlo Pi = 3.136556454622582 (error 0.16030719200519186 percent) Serial CC = 0.008932329043214497 (uncorrelated = 0.0). java.util.Random Samples = 10000000 bytes Entropy = 7.999980355066089 bits per byte Compressibility = 2.455616738838984E-4 percent Chi Square dist = 272.2516992 Chi Square prob = 0.25 (target = 0.5) Arithmetic mean = 127.4523454 (expected 127.5) Monte Carlo Pi = 3.144030057612023 (error 0.07758497968999184 percent) Serial CC = 1.9457666812929878E-4 (uncorrelated = 0.0). ec.util.MersenneTwister is still trailing java.util.Random on all metrics. Note that 100 samples is still small enough to have less value bins than there are in the sample space (100 samples vs 256 bins) and instantiations are temporally close enough that there are still long runs of MT instances with the exact same seed. However as I will now demonstrate the problems with ECJ's implementation of MT persist even once you have enough samples to make these issues irrelevant; > Mersenne Twister has an internal state of 624 longs. If you fill those > longs with a poor-entropy value, or successive poor-entropy values, > after just a few pumps (probably just one pump), you will *not* be able > to notice from the MT output. Oh really? Let's try 2000 samples (enough for three 'pumps') each from 5,000 instances; com.purplefrogtext.gpserver.FastRNG Samples = 10000000 bytes Entropy = 7.999981427842907 bits per byte Compressibility = 2.3215196366388469E-4 percent Chi Square dist = 257.4293504 Chi Square prob = 0.5 (target = 0.5) Arithmetic mean = 127.5888985 (expected 127.5) Monte Carlo Pi = 3.138675655470262 (error 0.09285093394262264 percent) Serial CC = 6.943699324379059E-4 (uncorrelated = 0.0). ec.util.MersenneTwister Samples = 10000000 bytes Entropy = 7.998478903440615 bits per byte Compressibility = 0.019013706992310198 percent Chi Square dist = 21117.694054399988 Chi Square prob = 1.0E-4 (target = 0.5) Arithmetic mean = 127.7651855 (expected 127.5) Monte Carlo Pi = 3.1336044534417815 (error 0.25427230799269135 percent) Serial CC = 0.002702179075196308 (uncorrelated = 0.0). java.util.Random Samples = 10000000 bytes Entropy = 7.999977990076586 bits per byte Compressibility = 2.7512404267016066E-4 percent Chi Square dist = 305.1845632000001 Chi Square prob = 0.025 (target = 0.5) Arithmetic mean = 127.4706723 (expected 127.5) Monte Carlo Pi = 3.1422516569006627 (error 0.020976726887762566 percent) Serial CC = 2.753449795306269E-5 (uncorrelated = 0.0). Nope, ec.util.MersenneTwister is still losing against java.util.Random on all metrics. Note that I'm using relatively simple tests here because I don't want to waste all morning on this; they can't detect the fact that many of the MersenneTwister instances are returning exactly the same sequence (because System.currentTimeMillis() has fairly sucky resolution). If I was using the full test suite, MersenneTwister would be performing much, much worse. Anyway, here's 100,000 samples each from 1000 instances; com.purplefrogtext.gpserver.FastRNG Samples = 100000000 bytes Entropy = 7.999998244340672 bits per byte Compressibility = 2.194574160174767E-5 percent Chi Square dist = 243.35768064 Chi Square prob = 0.5 (target = 0.5) Arithmetic mean = 127.49580098 (expected 127.5) Monte Carlo Pi = 3.1419982856799313 (error 0.01291167044443897 percent) Serial CC = 4.789864457056014E-5 (uncorrelated = 0.0). ec.util.MersenneTwister Samples = 100000000 bytes Entropy = 7.999992526963659 bits per byte Compressibility = 9.341295426068541E-5 percent Chi Square dist = 1036.1470259199996 Chi Square prob = 1.0E-4 (target = 0.5) Arithmetic mean = 127.50160563 (expected 127.5) Monte Carlo Pi = 3.1415667656626707 (error 8.24038313587706E-4 percent) Serial CC = 6.351734037428386E-5 (uncorrelated = 0.0). java.util.Random Samples = 100000000 bytes Entropy = 7.999998375272339 bits per byte Compressibility = 2.03090957606733E-5 percent Chi Square dist = 225.23552767999993 Chi Square prob = 0.9 (target = 0.5) Arithmetic mean = 127.50565567 (expected 127.5) Monte Carlo Pi = 3.140841725633669 (error 0.02390277922460475 percent) Serial CC = 5.0981060362350845E-5 (uncorrelated = 0.0). Finally, ec.util.MersenneTwister has caught up with java.util.Random; it has the same order of magnitude on serial correlation and entropy. Only the (extremely sensitive) chi square test is still detecting significant regularities in ec.util.MersenneTwister. FastRNG is still considerably ahead, despite the fact that it uses a very similar RNG algorithm (just with a different seed). At 1,000,000 samples from 100 instances, the quality of MersenneTwister's output finally catches up with FastRNG's; com.purplefrogtext.gpserver.FastRNG Samples = 100000000 bytes Entropy = 7.999998111348311 bits per byte Compressibility = 2.3608146115794426E-5 percent Chi Square dist = 261.8269440000002 Chi Square prob = 0.5 (target = 0.5) Arithmetic mean = 127.5059442 (expected 127.5) Monte Carlo Pi = 3.1413310856532433 (error 0.008325966011251356 percent) Serial CC = 5.907318448570472E-5 (uncorrelated = 0.0). ec.util.MersenneTwister Samples = 100000000 bytes Entropy = 7.999998170348357 bits per byte Compressibility = 2.287064553296858E-5 percent Chi Square dist = 253.65223935999998 Chi Square prob = 0.5 (target = 0.5) Arithmetic mean = 127.50161809 (expected 127.5) Monte Carlo Pi = 3.1413238856529553 (error 0.008555149138470275 percent) Serial CC = 8.444797288330956E-5 (uncorrelated = 0.0). java.util.Random Samples = 100000000 bytes Entropy = 7.999998593462497 bits per byte Compressibility = 1.7581718791959133E-5 percent Chi Square dist = 194.98636799999997 Chi Square prob = 0.995 (target = 0.5) Arithmetic mean = 127.49867176 (expected 127.5) Monte Carlo Pi = 3.141109325644373 (error 0.015384806329613794 percent) Serial CC = 2.8187345499363348E-5 (uncorrelated = 0.0). So, if you want to use ec.util.MersenneTwister in a serious application, apparently you should call it 1,000,000 times to warm it up first (I did some informal tests of this and it seems to work). Or you could just drop in the replacement seeding algorithm and get high quality output with no extra work. > It will be so random as to pass every known statistical test. By all means, please independently replicate my tests. There may be some obvious problem with them that I'm not seeing. I can't spare more time this morning on this otherwise I'd use more samples, but I think these numbers are adequate for statistical confidence. > Even if you're still not willing to trust the math, we're not loading > the internal state with low-entropy values. We're loading it with > relatively high-entropy values generated as the stream output of a > carefully-selected linear congruential generator whose seed in turn is > set to the initial 32-bit number you provide. You don't appear to understand entropy. Entropy isn't the computational cost of predicting the output given the input, it's the absolute uncertainty in the probability distribution over possible output values in any particular context. You can't magically create entropy from nothing; it has to come from outside the system. You /can/ improve the statistical quality of your output without requiring external entropy. However it is clear that the default state vector fill algorithm in the ECJ implementation has serious shortcomings. The easiest and highest performing way to fix this is to use external entropy, as the above results demonstrate (AFAIK - if anyone else has a better solution feel free to jump in). > Ordinarily linear congruential generators aren't fantastic to use as > random number generators for experiments, but they're just fine for > providing enough entropy PRNGs /never/ produce entropy. You can't create uncertainty in a deterministic system. PRNG's /transform/ their input into an output sequence that hopefully meets various statistical criteria and in the case of secure RNGs is computationally infeasible to deduce the PRNG's internal state from. > to MT so that those cargo-cult programmers who don't trust MT > theorists can feel like you've waved the necessary dead chickens. It would appear that in this case, the chickens were very much alive. Incidentally, when you reported a 'bug' in java.util.Random a few years back, regarding 'dimensional stability' in reusing an RNG vector element three times to generate three bytes, did you run statistical tests or did you rely on some combination of intuition and poultry of uncertain freshness? :) > This means that you should be able to use seeds like this: > > 2345, 2346, 2347, ... > > and have pretty high-quality randomly different experimental results. > If you have statistical results that counter this, I (and I think a > fair degree of the RNG community) would really like to see them > published! (no really! I'm not trying to be sarcastic. What do you have?) See above. > We chose to be 100% consistent with the MT standard rather than risk > wandering off the ranch in search of a seed twice as long (which only > benefits you if you need to do more than 2^32 experiments). This would not be true even if the seed was a genuine random choice between 2^32 values, which it isn't. The effective entropy in each bit falls off steadily (logarithmically I think, but you'd have to ask a mathematician) as you run the PRNG. Logically this is equivlanet to the theoretical probability distribution over the internal state, from the point of view of someone who can only see the output (but has unbounded computing power to analyse it); you don't suddenly go from 'completely unknown' to 'known'. However for most GP purposes we don't really care about absolute entropy as long as it's at least minimally decent. We /do/ however care about statistical quality during the first million samples, and while more input entropy may technically be overkill it's a cheap quick way to fix the problem. This is why my implementation of defaultSeed() XORs everything together into a 64-bit value despite having 320 bits of input; the number of bits of actual /entropy/ in the inputs is much lower, so a 320-bit output would be pointless. > If you'd like something else, you should just go in and set all 624 > initial values in the MT array: there's a function for that. The problem there is getting the initial values. FastRNG actually has an option to get them from Hotbits, an Internet server that uses a radioactive decay source to generate real high-quality entropy. But this is overkill for 99.999% of applications. Yes, you could wrap all your MersenneTwister applications in a complex extra-high-quality seeding algorithm. But defaultSeed() does the main job (reduces the warmup time from ~1,000,000 to ~0) and is transparent to users. Incidentally here's our (non-Hotbits) high quality seeder, mainly intended for cryptographic use (this is still experimental, not in production, and it's called immediately after the vector is initialised with defaultSeed() & the default MT init algorithm); final long[] seed = new long[20]; // Initialise low bits by hashing together system properties. try { final Properties sysprops = System.getProperties(); final MessageDigest digest = MessageDigest.getInstance("SHA"); for(final String pname : sysprops.stringPropertyNames()) { // Add a single property (including name) to the digest. for(int index = pname.length()-1; index>=0; index--) digest.update((byte)(pname.charAt(index))); final String pvalue = sysprops.getProperty(pname); if(pvalue!=null) { for(int index = pvalue.length()-1; index>=0; index--) digest.update((byte)(pvalue.charAt(index))); } } final byte[] output = digest.digest(); assert output.length==20; for(int index = 0; index<20; index++) seed[index] = output[index]*SHA1_EXPAND; } catch(NoSuchAlgorithmException e) { } // Mix in some info that wasn't already used in defaultSeed(). final Runtime rt = Runtime.getRuntime(); seed[0] = seed[0]^rt.maxMemory(); seed[1] = seed[1]^rt.totalMemory(); rt.gc(); seed[2] = seed[2]^rt.freeMemory(); seed[3] = seed[3]^rt.hashCode(); // Start some 'scheduling noise' generators. int noisecount = rt.availableProcessors()*NOISE_GENERATORS; final FastRNG[] noises = new FastRNG[noisecount]; final int mypri = Thread.currentThread().getPriority(); long nseed = seed[2]; do { noisecount--; nseed = nseed+seed[3]; noises[noisecount] = new FastRNG(nseed, mypri); } while(noisecount>0); // Generate primary entropy from timing measurements. for(int shift = 0; shift<12; shift++) { for(int index = 0; index<20; index++) { target.bitsused = 0; final long end = System.nanoTime()+LOCK_COUNT_INTERVAL; do { synchronized(target) { target.bitsused++; }; } while(System.nanoTime()<end||target.bitsused<MIN_LOCK_COUNT); seed[index] = seed[index]^(end<<shift)^ (((long)target.bitsused)<<(shift*4)); } } // Shut down noise generators. for(final FastRNG noise : noises) noise.bitsused = 0; target.addEntropy(seed); } Here's the helper method it uses; public void run() { // Do nothing if not called as part of generateEntropy. if(bitsused!=-1) return; // Loop until told to stop, sleeping pseudorandomly. while(true) { synchronized(this) { if(bitsused==0) return; } try { Thread.sleep(0, (int)(nextLong()&0x7FFFF)); } catch(InterruptedException e) { } } } This works via hashing the system properties and extracing timing noise from the OS's scheduler and runtime properties; similar to Java's SecureRandom, but in our tests considerably more effective. However it effectively locks up your system for a couple of seconds while it initialises. > Otherwise, there are *big* benefits to having a standard RNG. I can't think of any other than not causing people's ears to prick up when they read papers that describe ECJ. But even that isn't an issue here. Our implementation still uses MT19937-64, a standard algorithm very similar to the one ECJ uses, including the state vector initilisation. It just has an extra input stage /before/ that which fixes the observed warmup time issue. This is actually /more/ friendly to independent replication than the current ECJ implementation, because the resolution of System.currentTimeMillis() and hence the ECJ warmup period varies significantly between /major/ platforms (i.e. old Windows, new Windows, Linux). Our defaultSeed() isn't great (compared to a real high quality seed algorithm), but it's good enough to reduce the warmup time to zero or close to it on all platforms. > 32 bits, rather than a longg is not a big deal either. After all, > java.util.Random uses a only 48-bit seed rather than a long (and in fact > that's its entire internal state! java.util.Random is very poor). I don't follow your argument here. You're saying that java.util.Random is very poor for using only 48-bits. Then you say that using 32 bits is 'no big deal'. It's true that seed vs internal state is an apples vs oranges comparison, but that just makes me less able to understand what you're trying to say here. > Now: how to initialize RNGs in sequence? Assuming you let MT's default > seeder do its job, the answer then doesn't lie in seeding entropy, but > in simply guaranteeing that you have a different seed each time. Just > grab System.currentTimeMillis(), use it for your first seed, then use > the same value + C for your next seed, then the next value + 2C for your > next seed, then +3C, etc., where C is some integer > 0. Just maintain > that internal counter somewhere and you should be fine. This is adequate to ensure that sequences are distinct. It does not fix the warmup time issue, it does not ensure sequence unpredictability and it's an additional burden on the user for no good reason. One would not /expect/ temporally-adjacent calls to the default constructor to produce identical RNG sequences (certainly java.util.Random doesn't), so they should not. Michael Wilson Software Engineer Purple Frog Text