Improve summary performance by replacing slice with container/list (doubly-linked list)
During benchmark, we found that the underlying summary implementation, i.e. bquant, wastes much CPU on runtime.memorymove.
The benchmark here, https://github.com/beorn7/perks/blob/master/quantile/bench_test.go#L7-L23, cannot show the real performance since the recorded element is monotonically increasing. Thus lines containing copy are not reached.
After using container/list, we get a better performance with a series of random float64 inserted, which is much more similar to the real scenario.
Though using doubly-linked list introduce obj allocation, it perform double qps under micro-benchmark,
Linked List
goos: darwin
goarch: arm64
pkg: github.com/beorn7/perks/quantile
BenchmarkInsertTargetedSmallEpsilon
BenchmarkInsertTargetedSmallEpsilon-20 6934218 169.8 ns/op 96 B/op 3 allocs/op
PASS
Slice
goos: darwin
goarch: arm64
pkg: github.com/beorn7/perks/quantile
BenchmarkInsertTargetedSmallEpsilon
BenchmarkInsertTargetedSmallEpsilon-20 3421015 343.3 ns/op 0 B/op 0 allocs/op
PASS
The implementation can be found here https://github.com/lujiajing1126/perks/tree/use-linked-list
The question is how much of the CPU is then burned by the additional GC caused. In the past, we were certainly very paranoid with any kind of allocation. With better GC, this might be unfounded by now. But I guess we need a real-world end-to-end benchmark rather than a micro benchmark…
The question is how much of the CPU is then burned by the additional GC caused. In the past, we were certainly very paranoid with any kind of allocation. With better GC, this might be unfounded by now. But I guess we need a real-world end-to-end benchmark rather than a micro benchmark…
Thanks for the remind. We will setup a real-world E2E test to see if it could help and/or how much.
Closing for now as promised, let us know if you need this to be reopened! 🤗