Primes.jl icon indicating copy to clipboard operation
Primes.jl copied to clipboard

Segmented prime sieve for fun and profit

Open haampie opened this issue 4 years ago • 6 comments

Closes https://github.com/JuliaMath/Primes.jl/issues/45.

I have this code lying around for a bit now, but never turned it into a PR. Now I'm turning it into a PR to force myself to finish it :sweat_smile:. It could use some tests and maybe some code style improvements / cleanup.

What it brings is an iterator:

using Primes.SegmentedSieve: SegmentIterator, vec_count_ones

function count_primes(a, b)
    cnt = 0
    for segment in SegmentIterator(a:b, 32 * 1024)
        cnt += vec_count_ones(segment.segment)
    end
    return cnt
end

which makes some operations very fast, as they operate on O(1) memory instead of O(n). For reference:

julia> @time count(Primes.primesmask(2^31, 2^32))
  7.362490 seconds (3.86 k allocations: 802.346 MiB, 1.18% gc time)
98182656

julia> @time count_primes(2^31, 2^32)
  0.452503 seconds (2.19 k allocations: 375.266 KiB)
98182656

this is 16x faster.

There's some neat optimizations: you can pick a buffer size to make things fit in L1 cache (above it's 32kb); the inner loop is unrolled by hand such that 8 multiples are removed per iteration; memory is compressed s.t. 1 byte equals an interval of 30 numbers (through a {2, 3, 5}-wheel); multiples of {7, 11, 13, 17} are "pre-sieved" in the sense that a buffer of size 7 * 11 * 13 * 17 is sieved and repeatedly copied over, instead of actually sieving 7, 11, 13 and 17 each over the full range.

You can also print the segment if you like:

julia> for segment in SegmentIterator(1:180, 2)
           println(segment)
       end
    01  07  11  13  17  19  23  29  
00   .   x   x   x   x   x   x   x  
30   x   x   x   x   x   .   x   x  

    01  07  11  13  17  19  23  29  
60   x   x   x   x   .   x   x   x  
90   .   x   x   x   x   x   x   .  

     01  07  11  13  17  19  23  29  
120   .   x   x   .   x   x   .   x  
150   x   x   .   x   x   .   x   x 

So, here it has a buffer of 2 bytes, resembling an interval of 60 numbers, meaning it needs 3 iterations. Every x marks a prime number.

haampie avatar Sep 07 '20 14:09 haampie

Is there a documentation or a paper this algorithm is based on? I would like to help review it.

galanakis avatar Dec 11 '21 23:12 galanakis

I think primesieve.org is a good reference. The wheel is a generalization of the idea of skipping multiples of 2 (which is easy to code), see https://en.wikipedia.org/wiki/Wheel_factorization for a good overview.

If you take a {2, 3, 5}-wheel, then consider any number z = 2*3*5*x + y = 30*x + y, where y = 1, ..., 30. Then z is composite if y is a multiple of 2, 3 or 5. So you only have to verify y = {1, 7, 11, 13, 17, 19, 23, 29} are prime for all x. Those are 8 numbers, and them being prime or not is 1 bit of info, so this neatly compressed to 1 byte of data per interval of 30 numbers on the number line.

The next optimization after compressing the number line by a constant factor is to work on constant size segments of the number line. So instead of 1:n, you work in segments 1:b, b+1:2b, 2b+1:3b, ..., m*b+1:n, and you allocate the boolean array once and for all. Effectively you do:

for (a, b) in segments # sieve a:b
  fill!(segment, 1)
  for p in primes_so_far
    # find the smallest number q st p * q >= a.
    # then remove p * q, p * (q + 1), p * (q + 2), ... until p * (q + m) > b from segment.
  end
  # find all primes in `segment` and push to `primes_so_far`
end

the inner loop is finicky, cause (a) you don't want to always compute the smallest q s.t. p * q >= a, instead you already know what it is thanks to the last iteration in the previous segment. and (b) you aren't going to remove multiples q, q+1, q+2, ... as you're skipping all p * (q+i)'s that aren't in the {1, 7, 11, 13, 17, 19, 23, 29} spokes -- and this is where the loop is unrolled by hand, and where all the pain and suffering comes from. There's going to be 8 loops for all possible p mod 30's, and 8 steps inside that loop for all q mod 30 values, leading to 64 entrypoints into the loops, and the main optimization is going to be here that instead of doing "cross off q * p, increment q, check if p * q is inside the segment" you're adding a hot loop where you're checking if you can cross off e.g. p * [q, q + 6, q + 10, q + 12, q + 16, q + 18, q + 22, q + 28] at once.

haampie avatar Dec 11 '21 23:12 haampie

Thank you. I am already familiar with wheel factorization and the the Segmented Sieve (Bays-Hudson). I have to think a bit about combining the two. I appreciate the long response and I will think about it. I really like the idea of combining the 8 bits into one byte.

galanakis avatar Dec 12 '21 01:12 galanakis

I have not thought about it very much, but this paper has some other little optimizations, under Algorithms B and C.

https://link.springer.com/content/pdf/10.1007/BF01932283.pdf

Is your segmented sieve similar to this? Or are you using a completely different approach which is incompatible. Maybe it is incompatible, because this paper basically uses the usual 2-wheel, but I really have to think about it.

galanakis avatar Dec 12 '21 01:12 galanakis

Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance.

oscardssmith avatar May 25 '22 20:05 oscardssmith

Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance.

oscardssmith avatar May 25 '22 20:05 oscardssmith