datasketches-java icon indicating copy to clipboard operation
datasketches-java copied to clipboard

Reservoir sampling improvements

Open jmalkin opened this issue 11 months ago • 4 comments

We can make a number of improvements to reservoir sampling:

  • We can go from 2 random draws per accepted sample to 1 by picking a random long from 0 to n and accepting if it's less than k -- which also provides the location to use. This would get around the current max size bit limit (which is unlikely to be an issue, but still) by removing the use of a double for a value in [0, 1). But that'd also require a new serialization version since n is currently stored with fewer than 64 bits.
  • After discussing with a stats person about a year ago, we believe we could even switch to a random draw from the distribution for the next item, dropping the random draws to O(accepted items) rather than O(n). As long as merges happen independent of that random draw (so no adversary able to inspect it) then we should be able to merge and re-draw for the merged sample.
  • Merge can be improved to give uniform second-order probabilities. This is the exciting one for me. But there's a caveat.

Proposed merge, with python pseudocode:

# Merge two random samples. Samples are lists s1 and s2. The numbers n1 and n2
# are the sizes of the sets from which s1 and s2 were sampled.  Value s is the specified
# size for the merged sample. Requires s <= len(s1) and s <= len(s2). 
def merge(s1, s2, n1, n2, s): 
  # Find number to draw from s1. (The others will be from s2.)
  t = 0 # Number of observations to draw from s1.

  for i in range(s):
    j = random.randint(1, n1 + n2)
    
    if j <= n1:
      t += 1
      n1 -= 1
    else: n2 -= 1

  # Draw the samples.
  a = random.sample(s1, t) # Sample without replacement.
  b = random.sample(s2, s - t) # Sample without replacement.

  return a + b # Make one list with the elements of a and b.

The random sample without replacement is important here, which is the source of the caveat: You can almost use the first t and s-t samples from s1 and s2, respectively, except that there's a position bias from the initial reservoir fill. For items 0 to k-1, they can only ever appear at that specific index. We don't start placing randomly until after the reservoir is filled.

The options I can think of for this:

  1. Randomly permute items when the reservoir fills. If it hasn't filled at merge, you're just inserting as new items of weight 1 anyway. Doesn't help with already-serialized samples.
  2. Randomly permute the input sample at merge (a modified Fisher-Yates shuffle since you can stop after the first t or s-t items). But this is an undesirable side-effect for merging, even if it doesn't impact the probabilities.
  3. Copy the array and permute that. Uses more space, and potentially a lot more for a large reservoir, but avoids side-effects. Probably permute an array of integer indices to avoid creating an array of arbitrary objects.

Maybe we can mix 1 and 3? For new samples, randomly permute upon fill and track that with a flag (which would need to be serialized going forward, of course). If a sample needs to be merged and doesn't have the flag, then copy and permute the array.

This should probably have been 3 separate issues. But wanted to document the ideas before I forget yet again.

jmalkin avatar Feb 03 '25 06:02 jmalkin

I think option 3 (with the integer array approach ends up winning here. No side-effects, no worrying about whether a sample has or hasn't already been shuffled. And since a merge needs only k samples, I can get away with a modified Fisher-Yates shuffle with early stopping such that I only permute the number of items I need each time. That'll mean only k random draws to permute per merge rather than 2k.

jmalkin avatar Mar 10 '25 07:03 jmalkin

I tried a basic simulation of the cheaper sampling approach, in this case using octave to track simply the number of times a sample was added to the reservoir. I averaged across many trials (things at 100k didn't seem so different from 1-2M, happily).

There seems to be a small but consistent bias using the faster approach of picking the number of items to skip. I used small reservoirs of 5 and 7, and only went out to n of a few hundred but the bias seems to be consistent -- and to grow slowly over time. Since I can prove the existing algo's probabilities are correct, I'm treating any deviations from that as suspicious and likely problematic.

jmalkin avatar Mar 24 '25 02:03 jmalkin

Specifically, if I run long enough I end up with the new sampling algo accepting more points. Not sure when I'll get a chance to debug this more.

jmalkin avatar Mar 24 '25 03:03 jmalkin

I used the code samples from the comparison page in #298 to test the various algorithms, thanks to @richardstartin. Rather than drawing from a distribution and plotting, I went with a direct comparison of randomness. We expect a uniformly random distribution so I just used sequential integers for the sample and tracked how often I saw each one across many trials. Then I computed the entropy of the resulting counts.

k = 5, n = 2048, trials = 10M
Entropy:
        Reference: 11.0
        Algorithm R: 10.999969901903189
        Algorithm L: 10.999894443713615
        Algorithm X: 10.999311172690062
        Algorithm Z: 10.995795765496663
k = 5, n = 1024, trials = 10M
Entropy:
        Reference: 10.0
        Algorithm R: 9.999984679575581
        Algorithm L: 9.999836811505581
        Algorithm X: 9.998684771374391
        Algorithm Z: 9.991644691418209
k = 10, n = 128, trials = 10M:
Entropy:
        Reference: 7.0
        Algorithm R: 6.999999117283841
        Algorithm L: 6.999457302895078
        Algorithm X: 6.982288311480539
k = 5, n = 32, trials = 10M:
Entropy:
        Reference: 5.0
        Algorithm R: 4.999999735882753
        Algorithm L: 4.995728053697453
        Algorithm X: 4.964960870537897

Algorithm Z had issues with smaller sizes where some items were never selected. When I remembered to screen out zero counts so my entropy wasn't NaN, it did at least run But entropy was worse. I could re-run them I suppose but ... eh.

So ultimately the other algorithms are faster but they're not quite as uniformly random. That discrepancy does seem to diminish with size but doesn't entirely go away.

jmalkin avatar Apr 04 '25 17:04 jmalkin