fpc
fpc copied to clipboard
Enable reader/writer re-use, reduce allocations, fix benchmarks
Overview
This is a very cool project! While integrating the library into a project, I found a few places where performance could be improved relatively easily. My specific use case involved writing many sequences separately, requiring many new writers. This caused a lot of overhead, which I’ve now mostly removed.
While working with the benchmark suite I found the encoding benchmarks were misleading. If the number of iterations of the benchmark exceeded the number of values in the slice, the encoder would essentially be a ‘no-op’ since there was no more data. This dramatically increased the apparent throughput. In addition, byte buffers were re-used for each iteration, this isn’t possible. The ‘Reset’ method clears the buffer, again resulting in a no-op read, leading to inflated numbers. I’ve fixed these benchmarks and added more for my specific use case.
Benchmarks
A summary of notable benchmark changes is below with a 13900K, after the benchmark fixes:
Benchmark | Original MB/sec | New MB/sec |
---|---|---|
BenchmarkBlockEncode | 650.27MB/s | 1,000.78MB/s |
BenchmarkLeadingZeroBytes | 10,610.15MB/s | 83,906.17MB/s |
BenchmarkWriter | 6,490.96MB/s | 7,141.55MB/s |
BenchmarkReadFloats | 266.74MB/s | 320.11MB/s |
BenchmarkReadFloatsRepeated (new) | 264.93MB/s | 320.78MB/s |
BenchmarkReadFloatsRepeatedReuse (new) | 482.36MB/s | |
BenchmarkLargeEncode (new) | 650.72MB/s | 744.95MB/s |
BenchmarkLargeEncodeReuse (new) | 815.66MB/s |
The BenchmarkWriter performance is so high I’m suspicious… Not listed are reduced allocations and RAM usage which have both dropped significantly.
Summary of changes
- clz replaced with native Go method that uses native CPU instruction if available
- All major structs updated to re-use buffers where possible
- Almost zero allocation once 'warmed up'
- Clear / Reset methods added where possible to allow re-use of structs
- readFull change by @kamstrup integrated
- Updated Go version so we have access to clear()
I’ve not run a profiler or gone to extreme lengths to optimise so I’m sure there’s more on the table.