rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Use `persistent_term` for the feature flag registry

Open the-mikedavis opened this issue 1 year ago • 5 comments

This is an experiment for using persistent_term to hold the feature flag registry. rabbit_ff_registry is currently implemented with a module that is dynamically regenerated with erl_syntax, compile:forms/2 and code reloading. The module approach has the advantage of maximizing read speed but it has a few drawbacks:

  • We need workarounds to convince the dialyzer of the real typespecs of rabbit_ff_registry functions.
  • We need to be careful to not deadlock around the code server.
  • Regenerating the registry can be relatively slow: around 35-40ms for me locally per regeneration. With this patch I see a total savings of ~800ms during a regular single-node boot (bazel run broker) and a total savings of ~200ms with rabbitmqctl enable_feature_flag khepri_db for example.

The question is whether persistent_term will be fast enough. Running the usual one_fast perf-test flag setup I didn't see a difference between this branch and main but we'll want to run this through the full performance test suite before considering it.

This branch also includes a concurrency fix for rabbit_feature_flags:inject_test_feature_flags/2 which can cause the feature_flags_SUITE:registry_concurrent_reloads/1 case to misbehave.

the-mikedavis avatar Apr 12 '24 15:04 the-mikedavis

very creative!

lin72h avatar Apr 12 '24 22:04 lin72h

I took some microbenchmarks of the two approaches - generating a module (albeit an Elixir module in the test) and reading from a :persistent_term - for the rabbit_ff_registry:is_enabled/1 use-case:

bench.exs.txt

(which can be run with mv bench.exs.txt bench.exs and elixir bench.exs)

Benchee results...
Operating System: Linux
CPU Information: AMD Ryzen Threadripper 3970X 32-Core Processor
Number of Available Cores: 64
Available memory: 125.64 GB
Elixir 1.16.2
Erlang 26.2.4
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 1 s
time: 10 s
memory time: 2 s
reduction time: 2 s
parallel: 1
inputs: none specified
Estimated total run time: 30 s

Benchmarking dynamic module ...
Benchmarking persistent_term ...
Calculating statistics...
Formatting results...

Name                      ips        average  deviation         median         99th %
dynamic module        30.31 M       32.99 ns  ±7714.91%          30 ns          41 ns
persistent_term       17.67 M       56.58 ns  ±4861.17%          51 ns          70 ns

Comparison: 
dynamic module        30.31 M
persistent_term       17.67 M - 1.71x slower +23.59 ns

Memory usage statistics:

Name               Memory usage
dynamic module              0 B
persistent_term             0 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

Reduction count statistics:

Name            Reduction count
dynamic module                2
persistent_term               2 - 1.00x reduction count +0

**All measurements for reduction count were the same**

persistent_term is ~1.7x slower than is_enabled/1 on a module using the current set of feature flags. This is better than I expected: persistent_term is certainly slower but not by a very wide margin since both operations are so cheap.

the-mikedavis avatar Apr 16 '24 21:04 the-mikedavis

I also ran this branch against the performance tests vs. main: https://grafana.lionhead.rabbitmq.com/d/1OSIKB4Vk/rabbitmq-performance-tests?orgId=1&from=1713189778195&to=1713195192973&shareView=link

At a glance the two branches are hard to tell apart. The persistent_term may be slower but we might not read feature flags so often that it's noticeable. @mkuratczyk could I bug you to take a look and weigh in on these results when you get a chance?

the-mikedavis avatar Apr 16 '24 21:04 the-mikedavis

Yeah, I can't think of code paths where this could slow down normal operations. Small differences between environments are unfortunately always there in these tests (there are multiple reasons for this, including intentional randomization in a few places, both in perf-test as well as on the broker side). I'm fairly sure the observed differences are due to such entropy, not this PR

mkuratczyk avatar Apr 22 '24 17:04 mkuratczyk

On an M1 machine with 10 cores and a fast SSD, this shaves some 9% off of node-reported startup time 👏 on top of the improvements in #10989. This should nicely benefit all future CI runs where clusters are formed and nodes are restarted many times in total.

michaelklishin avatar Apr 23 '24 19:04 michaelklishin

Force-push was just a rebase

the-mikedavis avatar Jun 05 '24 15:06 the-mikedavis