roaring
roaring copied to clipboard
ContainsRange ?
I was trying to use roaring.Bitmap
in place of an interval tree when I dont want to recover the interval, I just want to see if there's overlap.
This works OK, but suffers because I have to either run Contains
for each base in my query interval or to create a new Bitmap for each query interval and then use AddRange
and then Intersects
a ContainsRange, could have several optimizations to this to extend the use-cases of the roaring bitmap.
See: https://gist.github.com/brentp/fecdc6d1c16cde4992d96d7c7a205166
My first instinct, when facing performance problems, is not to increase the API size, but to try to see where the bottleneck is.
Here is what I get out of your benchmark on my laptop, copying and pasting your code...
$ go version
go version go1.7.1 darwin/amd64
$ go test -bench Benchmark
testing: warning: no tests to run
BenchmarkBuildSTree-4 30 45734277 ns/op
BenchmarkBuildRBTree-4 100 18264742 ns/op
BenchmarkBuildRoaring-4 50 27331927 ns/op
BenchmarkQuerySRTree-4 20 60219301 ns/op
BenchmarkQueryRBTree-4 30 60122859 ns/op
BenchmarkQueryRoaring-4 30 58329105 ns/op
PASS
ok _/Users/lemire/tmp/tree 10.717s
So it looks like BenchmarkBuildRBTree
is quite a bit better than BenchmarkBuildRoaring
.
Why might this be? Let us profile it out...
$ go test -bench BenchmarkBuildRBTree -cpuprofile=cpu.out
$ go tool pprof tree.test cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top20
2.03s of 2.15s total (94.42%)
Dropped 22 nodes (cum <= 0.01s)
Showing top 20 nodes out of 73 (cum >= 0.12s)
flat flat% sum% cum cum%
0.85s 39.53% 39.53% 1.42s 66.05% github.com/biogo/store/interval.(*IntNode).insert
0.19s 8.84% 48.37% 0.19s 8.84% runtime.stkbucket
0.14s 6.51% 54.88% 0.14s 6.51% runtime.mach_semaphore_wait
0.14s 6.51% 61.40% 0.50s 23.26% runtime.mallocgc
0.13s 6.05% 67.44% 0.13s 6.05% runtime.mach_semaphore_signal
0.08s 3.72% 71.16% 0.08s 3.72% runtime.heapBitsSetType
0.07s 3.26% 74.42% 0.07s 3.26% runtime.mach_semaphore_timedwait
0.07s 3.26% 77.67% 0.11s 5.12% runtime.scanobject
0.06s 2.79% 80.47% 1.49s 69.30% github.com/biogo/store/interval.(*IntTree).Insert
0.06s 2.79% 83.26% 0.06s 2.79% runtime.memmove
0.04s 1.86% 85.12% 0.05s 2.33% github.com/biogo/store/interval.(*IntNode).rotateLeft
0.04s 1.86% 86.98% 0.04s 1.86% runtime.usleep
0.03s 1.40% 88.37% 0.03s 1.40% runtime.heapBitsForObject
0.03s 1.40% 89.77% 0.06s 2.79% runtime.shade
0.02s 0.93% 90.70% 0.02s 0.93% runtime.memclr
0.02s 0.93% 91.63% 0.52s 24.19% runtime.newobject
0.02s 0.93% 92.56% 0.02s 0.93% runtime/internal/atomic.Cas64
0.02s 0.93% 93.49% 0.02s 0.93% runtime/internal/atomic.Or8
0.01s 0.47% 93.95% 0.02s 0.93% runtime.(*mheap).allocSpanLocked
0.01s 0.47% 94.42% 0.12s 5.58% runtime.convT2I
$ go test -bench BenchmarkBuildRoaring -cpuprofile=cpu.out
$ go tool pprof tree.test cpu.out
1.28s of 1.28s total ( 100%)
Showing top 20 nodes out of 66 (cum >= 0.14s)
flat flat% sum% cum cum%
0.58s 45.31% 45.31% 0.58s 45.31% runtime.memmove
0.14s 10.94% 56.25% 0.14s 10.94% github.com/RoaringBitmap/roaring.(*roaringArray).binarySearch
0.11s 8.59% 64.84% 0.27s 21.09% github.com/RoaringBitmap/roaring.(*runContainer16).union
0.11s 8.59% 73.44% 0.23s 17.97% runtime.mallocgc
0.08s 6.25% 79.69% 0.08s 6.25% runtime.mach_semaphore_signal
0.05s 3.91% 83.59% 1.21s 94.53% github.com/RoaringBitmap/roaring.(*Bitmap).AddRange
0.05s 3.91% 87.50% 0.19s 14.84% runtime.growslice
0.04s 3.12% 90.62% 0.04s 3.12% runtime.memclr
0.02s 1.56% 92.19% 0.02s 1.56% runtime.heapBitsSetType
0.02s 1.56% 93.75% 0.02s 1.56% runtime.mach_semaphore_wait
0.01s 0.78% 94.53% 0.33s 25.78% github.com/RoaringBitmap/roaring.(*runContainer16).iaddRange
0.01s 0.78% 95.31% 0.03s 2.34% github.com/RoaringBitmap/roaring.rangeOfOnes
0.01s 0.78% 96.09% 0.02s 1.56% runtime.heapBitsBulkBarrier
0.01s 0.78% 96.88% 0.01s 0.78% runtime.heapBitsForObject
0.01s 0.78% 97.66% 0.01s 0.78% runtime.newBucket
0.01s 0.78% 98.44% 0.13s 10.16% runtime.newobject
0.01s 0.78% 99.22% 0.01s 0.78% runtime.rewindmorestack
0.01s 0.78% 100% 0.02s 1.56% runtime.stkbucket
0 0% 100% 1.21s 94.53% _/Users/lemire/tmp/tree_test.BenchmarkBuildRoaring
0 0% 100% 0.14s 10.94% github.com/RoaringBitmap/roaring.(*roaringArray).getIndex
So... we are spending 45% of our time on runtime.memmove
. But we can see why fairly quickly... The issue derives from this code...
https://github.com/RoaringBitmap/roaring/blob/master/roaringarray.go#L312-L325
Part of the problem is that we repeatedly insert in an array. We could, however, improve the performance. Yet I am not sure it is something we need to worry about. If you are spending that much time building bitmaps, there ought to be a good reason... and whatever this reason, it must involve a lot more work.
If I increase missed
to 30, then I get something else...
$ go test -bench Benchmark
testing: warning: no tests to run
BenchmarkBuildSTree-4 30 47642946 ns/op
BenchmarkBuildRBTree-4 100 21720282 ns/op
BenchmarkBuildRoaring-4 50 29983512 ns/op
BenchmarkQuerySRTree-4 5 289810191 ns/op
BenchmarkQueryRBTree-4 3 349252774 ns/op
BenchmarkQueryRoaring-4 2 535153380 ns/op
PASS
Ah. Then it is BenchmarkQueryRoaring that suffers. Let us try to find why...
$ go test -bench BenchmarkQueryRoaring -cpuprofile=cpu.out
$ go tool pprof tree.test cpu.out
(pprof) top10
1260ms of 1630ms total (77.30%)
Showing top 10 nodes out of 85 (cum >= 490ms)
flat flat% sum% cum cum%
290ms 17.79% 17.79% 580ms 35.58% runtime.mallocgc
220ms 13.50% 31.29% 500ms 30.67% runtime.growslice
210ms 12.88% 44.17% 210ms 12.88% runtime.mach_semaphore_signal
160ms 9.82% 53.99% 900ms 55.21% github.com/RoaringBitmap/roaring.(*Bitmap).AddRange
110ms 6.75% 60.74% 110ms 6.75% github.com/RoaringBitmap/roaring.(*roaringArray).advanceUntil
70ms 4.29% 65.03% 70ms 4.29% runtime.memmove
60ms 3.68% 68.71% 60ms 3.68% runtime.mach_semaphore_wait
50ms 3.07% 71.78% 50ms 3.07% github.com/RoaringBitmap/roaring.(*runContainer16).search
50ms 3.07% 74.85% 50ms 3.07% runtime.heapBitsSetType
40ms 2.45% 77.30% 490ms 30.06% github.com/RoaringBitmap/roaring.(*Bitmap).Intersects
So it seems that memory allocations is a big problem in this case. Ok. Let us look at the code...
for k := 0; k < missed; k++ {
o := roaring.New()
o.AddRange(uint64(iv.End+3), uint64(iv.End+3+i_length))
o.Intersects(tree)
}
That does not seem right. Let us rewrite this code...
o := roaring.New()
o.AddRange(uint64(iv.End+3), uint64(iv.End+3+i_length))
for k := 0; k < missed; k++ {
o.Intersects(tree)
}
Then we get...
$ go test -bench BenchmarkQuery
testing: warning: no tests to run
BenchmarkQuerySRTree-4 5 279626812 ns/op
BenchmarkQueryRBTree-4 5 308945598 ns/op
BenchmarkQueryRoaring-4 10 189430939 ns/op
PASS
ok _/Users/lemire/tmp/tree 7.653s
So, right... repeatedly creating temporary objects is quite bad. I submit to you that this is not just of roaring
but of Go in general. In Go, you are not supposed to be allocating lots of tiny temporary objects. The GC is not optimized for this code.
Ok. So, yes, we have a target. We could have a contains
method that returns true if there is something in the range. As long as this does away with memory allocation, it should boost the speed considerably.
I think it is a reasonable implementation target and can be done smartly so that it is easy to support.
Thanks for the analysis. I was more concerned with the Query time than the creation.
Re this:
That does not seem right. Let us rewrite this code...
o := roaring.New() o.AddRange(uint64(iv.End+3), uint64(iv.End+3+i_length)) for k := 0; k < missed; k++ { o.Intersects(tree) }
Then we get...
that side-steps what the benchmark is trying to do. It's a lazy way to simulate missed
random intervals that are less likely to have an overlap. Since it's random inside another random loop, I just made it do the same thing missed
times.
Regardless, yes, I agree this could be a nice optimization.
@brentp
that side-steps what the benchmark is trying to do
To be clear, I understood that part. The point of my change was to demonstrate that the performance problem was tied to the creation of many temporary objects.