Primes.jl
Primes.jl copied to clipboard
Segmented prime sieve for fun and profit
Closes https://github.com/JuliaMath/Primes.jl/issues/45.
I have this code lying around for a bit now, but never turned it into a PR. Now I'm turning it into a PR to force myself to finish it :sweat_smile:. It could use some tests and maybe some code style improvements / cleanup.
What it brings is an iterator:
using Primes.SegmentedSieve: SegmentIterator, vec_count_ones
function count_primes(a, b)
cnt = 0
for segment in SegmentIterator(a:b, 32 * 1024)
cnt += vec_count_ones(segment.segment)
end
return cnt
end
which makes some operations very fast, as they operate on O(1) memory instead of O(n). For reference:
julia> @time count(Primes.primesmask(2^31, 2^32))
7.362490 seconds (3.86 k allocations: 802.346 MiB, 1.18% gc time)
98182656
julia> @time count_primes(2^31, 2^32)
0.452503 seconds (2.19 k allocations: 375.266 KiB)
98182656
this is 16x faster.
There's some neat optimizations: you can pick a buffer size to make things fit in L1 cache (above it's 32kb); the inner loop is unrolled by hand such that 8 multiples are removed per iteration; memory is compressed s.t. 1 byte equals an interval of 30 numbers (through a {2, 3, 5}-wheel); multiples of {7, 11, 13, 17} are "pre-sieved" in the sense that a buffer of size 7 * 11 * 13 * 17 is sieved and repeatedly copied over, instead of actually sieving 7, 11, 13 and 17 each over the full range.
You can also print the segment if you like:
julia> for segment in SegmentIterator(1:180, 2)
println(segment)
end
01 07 11 13 17 19 23 29
00 . x x x x x x x
30 x x x x x . x x
01 07 11 13 17 19 23 29
60 x x x x . x x x
90 . x x x x x x .
01 07 11 13 17 19 23 29
120 . x x . x x . x
150 x x . x x . x x
So, here it has a buffer of 2 bytes, resembling an interval of 60 numbers, meaning it needs 3 iterations. Every x
marks a prime number.
Is there a documentation or a paper this algorithm is based on? I would like to help review it.
I think primesieve.org is a good reference. The wheel is a generalization of the idea of skipping multiples of 2 (which is easy to code), see https://en.wikipedia.org/wiki/Wheel_factorization for a good overview.
If you take a {2, 3, 5}-wheel, then consider any number z = 2*3*5*x + y = 30*x + y
, where y = 1, ..., 30. Then z is composite if y is a multiple of 2, 3 or 5. So you only have to verify y = {1, 7, 11, 13, 17, 19, 23, 29} are prime for all x. Those are 8 numbers, and them being prime or not is 1 bit of info, so this neatly compressed to 1 byte of data per interval of 30 numbers on the number line.
The next optimization after compressing the number line by a constant factor is to work on constant size segments of the number line. So instead of 1:n
, you work in segments 1:b
, b+1:2b
, 2b+1:3b
, ..., m*b+1:n
, and you allocate the boolean array once and for all. Effectively you do:
for (a, b) in segments # sieve a:b
fill!(segment, 1)
for p in primes_so_far
# find the smallest number q st p * q >= a.
# then remove p * q, p * (q + 1), p * (q + 2), ... until p * (q + m) > b from segment.
end
# find all primes in `segment` and push to `primes_so_far`
end
the inner loop is finicky, cause (a) you don't want to always compute the smallest q s.t. p * q >= a, instead you already know what it is thanks to the last iteration in the previous segment. and (b) you aren't going to remove multiples q, q+1, q+2, ... as you're skipping all p * (q+i)'s that aren't in the {1, 7, 11, 13, 17, 19, 23, 29} spokes -- and this is where the loop is unrolled by hand, and where all the pain and suffering comes from. There's going to be 8 loops for all possible p mod 30's, and 8 steps inside that loop for all q mod 30 values, leading to 64 entrypoints into the loops, and the main optimization is going to be here that instead of doing "cross off q * p, increment q, check if p * q is inside the segment" you're adding a hot loop where you're checking if you can cross off e.g. p * [q, q + 6, q + 10, q + 12, q + 16, q + 18, q + 22, q + 28] at once.
Thank you. I am already familiar with wheel factorization and the the Segmented Sieve (Bays-Hudson). I have to think a bit about combining the two. I appreciate the long response and I will think about it. I really like the idea of combining the 8 bits into one byte.
I have not thought about it very much, but this paper has some other little optimizations, under Algorithms B and C.
https://link.springer.com/content/pdf/10.1007/BF01932283.pdf
Is your segmented sieve similar to this? Or are you using a completely different approach which is incompatible. Maybe it is incompatible, because this paper basically uses the usual 2-wheel, but I really have to think about it.
Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance.
Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance.