crystal-memcached
crystal-memcached copied to clipboard
Optimize by reducing heap allocations
This PR optimizes the number of heap allocations performed in each request. For example, Array(UInt8).new(0) allocates an array on the heap whereas Bytes.empty involves no heap allocations — the Slice is allocated on the stack with no buffer on the heap because it isn't necessary to allocate an empty buffer.
I wrote up a benchmark to show how much of a difference this made. It's difficult to benchmark because each query spends only a small fraction of its time actively using the CPU in comparison to the amount of time it spends waiting on I/O. At scale, though, with several requests all hitting the cache at once, this can matter a lot so I thought I'd try to optimize where I could.
Benchmark code
require "benchmark"
require "../src/memcached"
mc = Memcached::Client.new
puts
puts "Getting a million values"
keys = Array.new(1_000_000, &.to_s)
pp Benchmark.measure { mc.get_multi keys }
pp Benchmark.memory { mc.get_multi keys }
Benchmark.ips do |x|
x.report "get hit" { mc.get "hit" }
x.report "get miss" { mc.get "miss" }
x.report "set" { mc.set "hit", "value" }
x.report "increment" { mc.increment "counter", 1, 0 }
end
Raw output
Before
Getting a million values
#<Benchmark::BM::Tms:0x10456d200
@cstime=0.0,
@cutime=0.0,
@label="",
@real=0.183867041,
@stime=0.015734,
@utime=0.167988>
242849552
get hit 54.47k ( 18.36µs) (± 2.84%) 352B/op 1.01× slower
get miss 55.14k ( 18.14µs) (± 2.37%) 320B/op fastest
set 53.84k ( 18.57µs) (± 2.39%) 336B/op 1.02× slower
increment 53.94k ( 18.54µs) (± 4.50%) 352B/op 1.02× slower
After
Getting a million values
#<Benchmark::BM::Tms:0x102af1280
@cstime=0.0,
@cutime=0.0,
@label="",
@real=0.1175465,
@stime=0.011751000000000001,
@utime=0.076792>
98847824
get hit 54.35k ( 18.40µs) (± 2.21%) 192B/op fastest
get miss 54.12k ( 18.48µs) (± 2.89%) 160B/op 1.00× slower
set 54.23k ( 18.44µs) (± 2.37%) 160B/op 1.00× slower
increment 52.97k ( 18.88µs) (± 2.64%) 160B/op 1.03× slower
The summary is that we save 50-55% on heap allocations per query — 50% per get miss (320 bytes->160) and 55% per increment (352 bytes->160) with set directly in the middle of the two (336 bytes->160). This results in saving 52% CPU time (183ms->88ms) and 60% reduction in the amount of memory allocated (232MB->94MB) when fetching 1M keys at once with get_multi.
Another (micro-)optimization used here is to replace the hash of opcodes with a NamedTuple. Indexing into a NamedTuple happens at compile time so there is no runtime performance cost at all. I considered using an enum, which offers the same benefit but doesn't require digging into a data structure at all (we would be able to pass :get instead of OPCODES[:get]), but maybe that could come in a followup PR.
Also, in the Crystal ecosystem, it seems to be a common convention to log routine queries and messages at the debug level and let app operators opt into seeing those logs by setting the LOG_LEVEL env var.
In hindsight, that could've (and probably should've) been a separate PR, but I did it in order to run the benchmark, too. So it's not entirely unrelated.