fpc icon indicating copy to clipboard operation
fpc copied to clipboard

Enable reader/writer re-use, reduce allocations, fix benchmarks

Open danclarke opened this issue 1 year ago • 0 comments

Overview

This is a very cool project! While integrating the library into a project, I found a few places where performance could be improved relatively easily. My specific use case involved writing many sequences separately, requiring many new writers. This caused a lot of overhead, which I’ve now mostly removed.

While working with the benchmark suite I found the encoding benchmarks were misleading. If the number of iterations of the benchmark exceeded the number of values in the slice, the encoder would essentially be a ‘no-op’ since there was no more data. This dramatically increased the apparent throughput. In addition, byte buffers were re-used for each iteration, this isn’t possible. The ‘Reset’ method clears the buffer, again resulting in a no-op read, leading to inflated numbers. I’ve fixed these benchmarks and added more for my specific use case.

Benchmarks

A summary of notable benchmark changes is below with a 13900K, after the benchmark fixes:

Benchmark Original MB/sec New MB/sec
BenchmarkBlockEncode 650.27MB/s 1,000.78MB/s
BenchmarkLeadingZeroBytes 10,610.15MB/s 83,906.17MB/s
BenchmarkWriter 6,490.96MB/s 7,141.55MB/s
BenchmarkReadFloats 266.74MB/s 320.11MB/s
BenchmarkReadFloatsRepeated (new) 264.93MB/s 320.78MB/s
BenchmarkReadFloatsRepeatedReuse (new) 482.36MB/s
BenchmarkLargeEncode (new) 650.72MB/s 744.95MB/s
BenchmarkLargeEncodeReuse (new) 815.66MB/s

The BenchmarkWriter performance is so high I’m suspicious… Not listed are reduced allocations and RAM usage which have both dropped significantly.

Summary of changes

  • clz replaced with native Go method that uses native CPU instruction if available
  • All major structs updated to re-use buffers where possible
  • Almost zero allocation once 'warmed up'
  • Clear / Reset methods added where possible to allow re-use of structs
  • readFull change by @kamstrup integrated
  • Updated Go version so we have access to clear()

I’ve not run a profiler or gone to extreme lengths to optimise so I’m sure there’s more on the table.

danclarke avatar Oct 02 '23 21:10 danclarke