roaring icon indicating copy to clipboard operation
roaring copied to clipboard

ContainsRange ?

Open brentp opened this issue 8 years ago • 4 comments

I was trying to use roaring.Bitmap in place of an interval tree when I dont want to recover the interval, I just want to see if there's overlap.

This works OK, but suffers because I have to either run Contains for each base in my query interval or to create a new Bitmap for each query interval and then use AddRange and then Intersects

a ContainsRange, could have several optimizations to this to extend the use-cases of the roaring bitmap.

brentp avatar Jan 06 '17 23:01 brentp

See: https://gist.github.com/brentp/fecdc6d1c16cde4992d96d7c7a205166

brentp avatar Jan 06 '17 23:01 brentp

My first instinct, when facing performance problems, is not to increase the API size, but to try to see where the bottleneck is.

Here is what I get out of your benchmark on my laptop, copying and pasting your code...

$ go version
go version go1.7.1 darwin/amd64
$ go test -bench Benchmark
testing: warning: no tests to run
BenchmarkBuildSTree-4     	      30	  45734277 ns/op
BenchmarkBuildRBTree-4    	     100	  18264742 ns/op
BenchmarkBuildRoaring-4   	      50	  27331927 ns/op
BenchmarkQuerySRTree-4    	      20	  60219301 ns/op
BenchmarkQueryRBTree-4    	      30	  60122859 ns/op
BenchmarkQueryRoaring-4   	      30	  58329105 ns/op
PASS
ok  	_/Users/lemire/tmp/tree	10.717s

So it looks like BenchmarkBuildRBTree is quite a bit better than BenchmarkBuildRoaring.

Why might this be? Let us profile it out...

$ go test -bench BenchmarkBuildRBTree -cpuprofile=cpu.out
$ go tool pprof tree.test cpu.out
Entering interactive mode (type "help" for commands)
(pprof) top20
2.03s of 2.15s total (94.42%)
Dropped 22 nodes (cum <= 0.01s)
Showing top 20 nodes out of 73 (cum >= 0.12s)
      flat  flat%   sum%        cum   cum%
     0.85s 39.53% 39.53%      1.42s 66.05%  github.com/biogo/store/interval.(*IntNode).insert
     0.19s  8.84% 48.37%      0.19s  8.84%  runtime.stkbucket
     0.14s  6.51% 54.88%      0.14s  6.51%  runtime.mach_semaphore_wait
     0.14s  6.51% 61.40%      0.50s 23.26%  runtime.mallocgc
     0.13s  6.05% 67.44%      0.13s  6.05%  runtime.mach_semaphore_signal
     0.08s  3.72% 71.16%      0.08s  3.72%  runtime.heapBitsSetType
     0.07s  3.26% 74.42%      0.07s  3.26%  runtime.mach_semaphore_timedwait
     0.07s  3.26% 77.67%      0.11s  5.12%  runtime.scanobject
     0.06s  2.79% 80.47%      1.49s 69.30%  github.com/biogo/store/interval.(*IntTree).Insert
     0.06s  2.79% 83.26%      0.06s  2.79%  runtime.memmove
     0.04s  1.86% 85.12%      0.05s  2.33%  github.com/biogo/store/interval.(*IntNode).rotateLeft
     0.04s  1.86% 86.98%      0.04s  1.86%  runtime.usleep
     0.03s  1.40% 88.37%      0.03s  1.40%  runtime.heapBitsForObject
     0.03s  1.40% 89.77%      0.06s  2.79%  runtime.shade
     0.02s  0.93% 90.70%      0.02s  0.93%  runtime.memclr
     0.02s  0.93% 91.63%      0.52s 24.19%  runtime.newobject
     0.02s  0.93% 92.56%      0.02s  0.93%  runtime/internal/atomic.Cas64
     0.02s  0.93% 93.49%      0.02s  0.93%  runtime/internal/atomic.Or8
     0.01s  0.47% 93.95%      0.02s  0.93%  runtime.(*mheap).allocSpanLocked
     0.01s  0.47% 94.42%      0.12s  5.58%  runtime.convT2I
$ go test -bench BenchmarkBuildRoaring -cpuprofile=cpu.out
$  go tool pprof tree.test cpu.out
1.28s of 1.28s total (  100%)
Showing top 20 nodes out of 66 (cum >= 0.14s)
      flat  flat%   sum%        cum   cum%
     0.58s 45.31% 45.31%      0.58s 45.31%  runtime.memmove
     0.14s 10.94% 56.25%      0.14s 10.94%  github.com/RoaringBitmap/roaring.(*roaringArray).binarySearch
     0.11s  8.59% 64.84%      0.27s 21.09%  github.com/RoaringBitmap/roaring.(*runContainer16).union
     0.11s  8.59% 73.44%      0.23s 17.97%  runtime.mallocgc
     0.08s  6.25% 79.69%      0.08s  6.25%  runtime.mach_semaphore_signal
     0.05s  3.91% 83.59%      1.21s 94.53%  github.com/RoaringBitmap/roaring.(*Bitmap).AddRange
     0.05s  3.91% 87.50%      0.19s 14.84%  runtime.growslice
     0.04s  3.12% 90.62%      0.04s  3.12%  runtime.memclr
     0.02s  1.56% 92.19%      0.02s  1.56%  runtime.heapBitsSetType
     0.02s  1.56% 93.75%      0.02s  1.56%  runtime.mach_semaphore_wait
     0.01s  0.78% 94.53%      0.33s 25.78%  github.com/RoaringBitmap/roaring.(*runContainer16).iaddRange
     0.01s  0.78% 95.31%      0.03s  2.34%  github.com/RoaringBitmap/roaring.rangeOfOnes
     0.01s  0.78% 96.09%      0.02s  1.56%  runtime.heapBitsBulkBarrier
     0.01s  0.78% 96.88%      0.01s  0.78%  runtime.heapBitsForObject
     0.01s  0.78% 97.66%      0.01s  0.78%  runtime.newBucket
     0.01s  0.78% 98.44%      0.13s 10.16%  runtime.newobject
     0.01s  0.78% 99.22%      0.01s  0.78%  runtime.rewindmorestack
     0.01s  0.78%   100%      0.02s  1.56%  runtime.stkbucket
         0     0%   100%      1.21s 94.53%  _/Users/lemire/tmp/tree_test.BenchmarkBuildRoaring
         0     0%   100%      0.14s 10.94%  github.com/RoaringBitmap/roaring.(*roaringArray).getIndex

So... we are spending 45% of our time on runtime.memmove. But we can see why fairly quickly... The issue derives from this code...

https://github.com/RoaringBitmap/roaring/blob/master/roaringarray.go#L312-L325

Part of the problem is that we repeatedly insert in an array. We could, however, improve the performance. Yet I am not sure it is something we need to worry about. If you are spending that much time building bitmaps, there ought to be a good reason... and whatever this reason, it must involve a lot more work.

If I increase missed to 30, then I get something else...

$ go test -bench Benchmark
testing: warning: no tests to run
BenchmarkBuildSTree-4     	      30	  47642946 ns/op
BenchmarkBuildRBTree-4    	     100	  21720282 ns/op
BenchmarkBuildRoaring-4   	      50	  29983512 ns/op
BenchmarkQuerySRTree-4    	       5	 289810191 ns/op
BenchmarkQueryRBTree-4    	       3	 349252774 ns/op
BenchmarkQueryRoaring-4   	       2	 535153380 ns/op
PASS

Ah. Then it is BenchmarkQueryRoaring that suffers. Let us try to find why...

$ go test -bench BenchmarkQueryRoaring -cpuprofile=cpu.out
$ go tool pprof tree.test cpu.out
(pprof) top10
1260ms of 1630ms total (77.30%)
Showing top 10 nodes out of 85 (cum >= 490ms)
      flat  flat%   sum%        cum   cum%
     290ms 17.79% 17.79%      580ms 35.58%  runtime.mallocgc
     220ms 13.50% 31.29%      500ms 30.67%  runtime.growslice
     210ms 12.88% 44.17%      210ms 12.88%  runtime.mach_semaphore_signal
     160ms  9.82% 53.99%      900ms 55.21%  github.com/RoaringBitmap/roaring.(*Bitmap).AddRange
     110ms  6.75% 60.74%      110ms  6.75%  github.com/RoaringBitmap/roaring.(*roaringArray).advanceUntil
      70ms  4.29% 65.03%       70ms  4.29%  runtime.memmove
      60ms  3.68% 68.71%       60ms  3.68%  runtime.mach_semaphore_wait
      50ms  3.07% 71.78%       50ms  3.07%  github.com/RoaringBitmap/roaring.(*runContainer16).search
      50ms  3.07% 74.85%       50ms  3.07%  runtime.heapBitsSetType
      40ms  2.45% 77.30%      490ms 30.06%  github.com/RoaringBitmap/roaring.(*Bitmap).Intersects

So it seems that memory allocations is a big problem in this case. Ok. Let us look at the code...

			for k := 0; k < missed; k++ {
			        o := roaring.New()
			        o.AddRange(uint64(iv.End+3), uint64(iv.End+3+i_length))
				o.Intersects(tree)
			}

That does not seem right. Let us rewrite this code...

			o := roaring.New()
			o.AddRange(uint64(iv.End+3), uint64(iv.End+3+i_length))
			for k := 0; k < missed; k++ {
				o.Intersects(tree)
			}

Then we get...

$ go test -bench BenchmarkQuery
testing: warning: no tests to run
BenchmarkQuerySRTree-4    	       5	 279626812 ns/op
BenchmarkQueryRBTree-4    	       5	 308945598 ns/op
BenchmarkQueryRoaring-4   	      10	 189430939 ns/op
PASS
ok  	_/Users/lemire/tmp/tree	7.653s

So, right... repeatedly creating temporary objects is quite bad. I submit to you that this is not just of roaring but of Go in general. In Go, you are not supposed to be allocating lots of tiny temporary objects. The GC is not optimized for this code.

Ok. So, yes, we have a target. We could have a contains method that returns true if there is something in the range. As long as this does away with memory allocation, it should boost the speed considerably.

I think it is a reasonable implementation target and can be done smartly so that it is easy to support.

lemire avatar Jan 07 '17 02:01 lemire

Thanks for the analysis. I was more concerned with the Query time than the creation.

Re this:

That does not seem right. Let us rewrite this code...

  	o := roaring.New()
  	o.AddRange(uint64(iv.End+3), uint64(iv.End+3+i_length))
  	for k := 0; k < missed; k++ {
  		o.Intersects(tree)
  	}

Then we get...

that side-steps what the benchmark is trying to do. It's a lazy way to simulate missed random intervals that are less likely to have an overlap. Since it's random inside another random loop, I just made it do the same thing missed times.

Regardless, yes, I agree this could be a nice optimization.

brentp avatar Jan 07 '17 02:01 brentp

@brentp

that side-steps what the benchmark is trying to do

To be clear, I understood that part. The point of my change was to demonstrate that the performance problem was tied to the creation of many temporary objects.

lemire avatar Jan 07 '17 05:01 lemire