## Testing Randomness with Compression

Maple

Continuing on in this series of posts, here is a way to test the randomness of a sequence of bits from a PRNG that is the appropriate to the first morning back after the August long weekend.  It is a very fast, and not very formal test done by checking how well a sequence compresses. This is really easy in Maple 14, with the new commands ?StringTools:-Compress and StringTools:-Uncompress which use zlib.  Here is a simple procedure to take a list of ones and zeroes and return (approximately) how compressible it is:

`Compressibility := proc(L::list, \$)local byteArray, compressedArray, compression;    byteArray := Array(1..floor(nops(L)/8),        i->Bits:-Join([seq(L[j],j=8*(i-1)+1..8*(i-1)+8)])-128,        datatype=integer);    compressedArray := StringTools:-Compress(byteArray);    compression := (rtable_num_elems(compressedArray)-9) / rtable_num_elems(byteArray);    return  evalf(compression);end proc: #Compressibility`

It simply packs the bits into one byte signed integers (Maple type integer) and then compresses. A fudge factor of 9 is subtracted from the length of the compresses array to take into account the zlib header and then a ratio is computed. An answer greater than 1 means the input was uncompressible. An answer less than one likely means the input is likely not random - the smaller the number, the less likely.

`(**) Compressibility([0\$10000]);#                         0.007200000000(**) Compressibility([seq(op([1,0]),i=1..5000)]);#                         0.008000000000(**) r1 := rand(0..1):(**) L1 := [seq(r1(), i=1..10000)]:(**) Compressibility(L1);#                          1.001600000`

Now for the bad PRNGs from the previous post:

`(**) r2a := rand(0..2): r2 := ()-> r2a() mod 2:(**) L2 := [seq(r2(), i=1..10000)]:(**) Compressibility(L2);#                          0.9800000000(**) LCState := 1: r3 := proc() global LCState;(**)    LCState := irem(89853*LCState, 50027);(**)    irem(LCState, 2); end proc:(**) L3 := [seq(r3(), i=1..10000)]:(**) Compressibility(L3);#                          0.3632000000`

L2 compresses only a little bit since it is random, just not uniform (0 appears twice as often as 1). L3 compresses quite a lot since even though it has greater entropy than L2 ( 7.485 vs. 7.197 as byte sequences )  it is actually periodic. So, zlib compressibility ends up encoding not just entropy / byte frequency but it can also detect longer periodic behavior.  Entropy is often described intuitively as the compressibility of a sequence/string, but since it is computed from a character by character frequency, it really only describes compressibility by Huffman style encoding. The DEFLATE algorithm of zlib includes both Lempel-Ziv and Huffman encoding, so it can acheive better compression than the entropy might suggect if there are large repeated subsequences. If you take the sequence that is just two copies of L2, it will have exactly the same entropy, but it will compress twice as well:

```(**) Compressibility([op(L2),op(L2)]);
#                          0.4948000000
```

Finally, here is a worksheet that computes compressibility as well as entropy and byte frequency counts Compressibility.mw ﻿