crystal-memcached icon indicating copy to clipboard operation
crystal-memcached copied to clipboard

Optimize by reducing heap allocations

Open jgaskins opened this issue 2 years ago • 0 comments

This PR optimizes the number of heap allocations performed in each request. For example, Array(UInt8).new(0) allocates an array on the heap whereas Bytes.empty involves no heap allocations — the Slice is allocated on the stack with no buffer on the heap because it isn't necessary to allocate an empty buffer.

I wrote up a benchmark to show how much of a difference this made. It's difficult to benchmark because each query spends only a small fraction of its time actively using the CPU in comparison to the amount of time it spends waiting on I/O. At scale, though, with several requests all hitting the cache at once, this can matter a lot so I thought I'd try to optimize where I could.

Benchmark code
require "benchmark"

require "../src/memcached"

mc = Memcached::Client.new

puts
puts "Getting a million values"
keys = Array.new(1_000_000, &.to_s)
pp Benchmark.measure { mc.get_multi keys }
pp Benchmark.memory { mc.get_multi keys }

Benchmark.ips do |x|
  x.report "get hit" { mc.get "hit" }
  x.report "get miss" { mc.get "miss" }
  x.report "set" { mc.set "hit", "value" }
  x.report "increment" { mc.increment "counter", 1, 0 }
end
Raw output

Before

Getting a million values
#<Benchmark::BM::Tms:0x10456d200
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=0.183867041,
 @stime=0.015734,
 @utime=0.167988>
242849552
  get hit  54.47k ( 18.36µs) (± 2.84%)  352B/op   1.01× slower
 get miss  55.14k ( 18.14µs) (± 2.37%)  320B/op        fastest
      set  53.84k ( 18.57µs) (± 2.39%)  336B/op   1.02× slower
increment  53.94k ( 18.54µs) (± 4.50%)  352B/op   1.02× slower

After

Getting a million values
#<Benchmark::BM::Tms:0x102af1280
 @cstime=0.0,
 @cutime=0.0,
 @label="",
 @real=0.1175465,
 @stime=0.011751000000000001,
 @utime=0.076792>
98847824
  get hit  54.35k ( 18.40µs) (± 2.21%)  192B/op        fastest
 get miss  54.12k ( 18.48µs) (± 2.89%)  160B/op   1.00× slower
      set  54.23k ( 18.44µs) (± 2.37%)  160B/op   1.00× slower
increment  52.97k ( 18.88µs) (± 2.64%)  160B/op   1.03× slower

The summary is that we save 50-55% on heap allocations per query — 50% per get miss (320 bytes->160) and 55% per increment (352 bytes->160) with set directly in the middle of the two (336 bytes->160). This results in saving 52% CPU time (183ms->88ms) and 60% reduction in the amount of memory allocated (232MB->94MB) when fetching 1M keys at once with get_multi.

Another (micro-)optimization used here is to replace the hash of opcodes with a NamedTuple. Indexing into a NamedTuple happens at compile time so there is no runtime performance cost at all. I considered using an enum, which offers the same benefit but doesn't require digging into a data structure at all (we would be able to pass :get instead of OPCODES[:get]), but maybe that could come in a followup PR.


Also, in the Crystal ecosystem, it seems to be a common convention to log routine queries and messages at the debug level and let app operators opt into seeing those logs by setting the LOG_LEVEL env var.

In hindsight, that could've (and probably should've) been a separate PR, but I did it in order to run the benchmark, too. So it's not entirely unrelated.

jgaskins avatar Aug 31 '23 16:08 jgaskins