emmy icon indicating copy to clipboard operation
emmy copied to clipboard

Very slow and memory inefficient algorithm used to sieve primes...

Open GordonBGood opened this issue 4 years ago • 4 comments

This repo's implementation of sieveUpTo called by primesUpTo* has got to be one of the slowest and most wasteful of memory implementation of the Sieve of Eratosthenes of any I have seen...

The problems are as follows:

  1. In using IntSet as it's base storage, it uses multiple bytes per stored composite entry, which memory use is so high that it can't sieve even up to a billion as it consumes Gigabytes of memory over that range as it uses many bytes per entry for every composite number over the range.
  2. It doesn't even implement the very simple optimization of sieving "odds-only" so there are about two and a half times more culling operations than necessary (and twice as much memory is required for all the unused evens).
  3. Composite culling operation execution overheads are atrocious (although constant time on the average) as it must perform a hashing operation each time.
  4. Cache associativity is also atrocious as it is skipping over this huge linked memory storage.

It has the single advantage of being short. However, it can't sieve to a billion using a reasonably amount of RAM memory and even if it could it would take about a minute or so.

A much better short implementation which greatly reduces the memory consumption such that very large ranges can be sieved (although still somewhat slowly due to the hashing) is as follows:

# compile with "-d:release -d:danger"

import tables
 
type PrimeType = uint64
proc primesHashTable(): iterator(): PrimeType {.closure.} =
  iterator output(): PrimeType {.closure.} =
    # some initial values to avoid race and reduce initializations...
    yield 2.PrimeType; yield 3.PrimeType; yield 5.PrimeType; yield 7.PrimeType
    var h = initTable[PrimeType,PrimeType]()
    var n = 9.PrimeType
    let bps = primesHashTable()
    var bp = bps() # advance past 2
    bp = bps(); var q = bp * bp # to initialize with 3
    while true:
      if n >= q:
        let inc = bp + bp
        h.add(n + inc, inc)
        bp = bps(); q = bp * bp
      elif h.hasKey(n):
        var inc: PrimeType
        discard h.take(n, inc)
        var nxt = n + inc
        while h.hasKey(nxt): nxt += inc # ensure no duplicates
        h.add(nxt, inc)
      else: yield n
      n += 2.PrimeType
  output

# tested as...
from times import epochTime

let limit = 100_000_000.PrimeType
let start = epochTime()
let iter = primesHashTable()
var count = 0
for p in iter():
  if p > limit: break else: count.inc
let elpsd = (epochTime() - start) * 1000.0
echo "Found ", count, " primes up to ", limit, " in ", elpsd, " milliseconds."

This is an "incremental" Odds-Only Sieve of Eratosthenes and only stores a hash entry for each of the base primes up to the square root of the current iteration value (determined automatically by the recursive secondary base primes feed). It still takes about a minute to count the number of primes to a billion, but at least it doesn't take an excessive amount of memory to do so. Its output also doesn't consume memory as it is just an iterator over the "incrementally" found primes.

If being so short isn't a consideration, the Page Segmented Bit-Packed Odds-Only Unbounded Sieve of Eratosthenes from RosettaCode is fast as in only taking about a second to sieve to a billion, consumes even less memory, and can even relatively easily have multi-threading added, but is over a hundred lines of code.

GordonBGood avatar Sep 28 '19 05:09 GordonBGood

Hi @GordonBGood I agree that the implementation is not really good (this is why the test for primality for bigger ints uses deterministic Miller-Rabin, and probabilistic Miller-Rabin for even bigger numbers).

But I am confused about your point 1 (and consequently 3 and 4 as well). The documentation for intsets says literally

An efficient set of int implemented as a sparse bit set.

Now, I admit I never looked at the implementation, but if I understand correctly this means that a set of integers is represented by some pattern of bits like 00010110101001...001001 where a bit in position k means that k is in the set. For a billion that would be 10^9 bits =~ 125MB, not multiple Gigabytes. It certainly doesn't use multiple bytes per entry. I agree this is quite a lot - primes are sparse so most of that will be 0. Since the ratio of populated entries is about log n / n, for bigger n there are certainly more efficient daa structures, but for smaller n 1 bit per entry is reasonable. Also, I don't understand why in 3 you speak about hashes: if I understand correctly there should be no hashing going on in this implementation.

Unless, of course, I have totally mistunderstood what intsets are. Can you explain a little better why you speak of multiple bytes per entry?

BTW, I completely agree that one can skip even numbers, or for what is worth various classes modulo k for as big k as you care to implement.

andreaferretti avatar Sep 30 '19 08:09 andreaferretti

@andreaferretti:

not really good (this is why the test for primality for bigger ints uses...

Yes, I appreciate that, but being so slow, you have to switch to the more complex math sooner than you should have to.

An efficient set of int implemented as a sparse bit set.

I did have a quick look at the implementation and while it's true that it uses a one bit representation per number in a packed (32-bit) representation, it is still within a nested (so that it can be sparse) hashed tree structure to quickly find the actual populated bits, and that is as far as I looked as I could see why it was taking an average of much more than one bit per represented composite number, and also why it is slow in having to do hashing. As promoted, it is faster per operation than using a full hash table (which is why my table implementation for odds-only with less operations is no faster than your implementation), but probably an inappropriate use of IntSet's when a simple bit-packed array would have done the job, and much faster other than still having the cache associativity problem for large arrays - but at that, the cache delays are still smaller than the hashing time costs plus the associativity cost.

So no, it isn't a simple bit-packed array using the amount of memory as you calculated, although it does use some bit-packed words within the hash tables to save some space. To save much more space and time, it is quite trivial to implement a "monolithic" bit-packed array version which would then only use 62.5 Megabytes for a range of a billion for "odds-only" and would reduce the time to cull by about a third, not more because of the remaining cache associativity problem. This type of "naive" SoE (other than moving up one step to use bit-packing and odds-only) can be found anywhere, including RosettaCode.

I highly recommend the page segmented version as it avoids the cache associativity problem at the cost of 150 LOC or so.

or for what is worth various classes modulo k for as big k as you care to implement.

I've done quite a lot of work with maximal wheel factorization as you describe, and while one can implement it efficiently up to wheels eliminating all the primes up to 7 or even 11, anything above that doesn't offer anything much in gains for the cost. The efficient ways of doing this cost another few hundred LOC. One of these days, I'll be posting a way of doing this on the forum that uses Nim to be faster than Kim Walisch's "primesieve", with the main hold-up ATM being the instability of how Nim does threading.

GordonBGood avatar Sep 30 '19 09:09 GordonBGood

Thank you, I did not know the internals of intsets. I can easily replace that with custom intsets, and make it odd only. Or switch to your version, I am still not sure. But in any case, I expect that testing a single prime will be much faster with deterministic Miller Rabin

andreaferretti avatar Sep 30 '19 09:09 andreaferretti

@ andreaferretti:

I can easily replace that with custom intsets, and make it odd only. Or switch to your version, I am still not sure.

I think you'll find that (my RosettaCode) page segmented version is immensely faster than anything you can do with intsets, and even that is limited by the time it takes to iterate over the found primes, if you don't mind the number of LOC. Further refinements can likely reduce the time by another factor of three or so using multi-threading in spite of the time it takes to iterate to down to a fraction of a second to iterate over the primes to a billion using a reasonable CPU.

If you don't need to iterate over the primes, then it can be made even faster down to "primesieve" type speeds, but I assume that you do.

If you want to stay simple, an odds-only bit-packed "monolithic" array will do it, take only one bit per odd number to the sieving range, and be about six times faster than your intset version in spite of the huge array to a billion:

iterator primesUpTo(top: uint): uint =
  let topndx = int((top - 3) div 2)
  let sqrtndx = (int(sqrt float64(top)) - 3) div 2
  var cmpsts = newSeq[uint32](topndx div 32 + 1)
  for i in 0 .. sqrtndx: # cull composites for primes
    if (cmpsts[i shr 5] and (1u32 shl (i and 31))) == 0:
      let p = i + i + 3
      for j in countup((p * p - 3) div 2, topndx, p):
        cmpsts[j shr 5] = cmpsts[j shr 5] or (1u32 shl (j and 31))
  yield 2 # separate culling above and iteration here
  for i in 0 .. topndx:
    if (cmpsts[i shr 5] and (1u32 shl (i and 31))) == 0:
      yield uint(i + i + 3)

GordonBGood avatar Sep 30 '19 10:09 GordonBGood