Benchmark CI

Open pitag-ha opened this issue 2 years ago • 1 comments

The latency of Merlin queries depends on a lot of different factors, such as

The global buffer it's run on; in particular, size and typing complexity.
The location inside the buffer it's run on.
The dependency graph of the buffer.
Whether and which PPX is applied.
Merlin's cache state at the moment the query is run.
Which Merlin query is run.

So for meaningful benchmark results, we need to run Merlin on a big variety of input samples. We've written merl-an to generate such an input sample set in a random but deterministic way. It has a merl-an benchmark command, which persists the telemetry part of the Merlin response in the format expected by current-bench.

The next steps to get a Merlin benchmark CI up and running are:

[x] Finish the PoC for a current-bench CI on Merlin using merl-an. We're currently blocked on this by a current-bench issue. Done: see PoC graphs
[x] Improve the separation into different benchmarks (in merl-an): I think, with the current merl-an output, current-bench will create one different graph for each file that's being benchmarked. That doesn't scale. Instead: One graph per cache workflow and per query or similar.
[x] Improve the Docker set-up: The whole benchmark set-up, such as installing merl-an and fetching the code base on which we run Merlin should be done inside the container etc.
[ ] Filter out spikes (on merl-an). Non-reproducible latency spikes (i.e. timings that exceed the expected timing by over factor 10), mess up the scale of the current-bench graphs.
[ ] Add cold-cache workflow to the benchmarks: The reason why the numbers look so good at the moment is that both cmi-caches and typer cache are fully warmed on all queries. Additionally, it would be interesting to have benchmarks for when the caches are cold.
[ ] Improve the output UX: When some samples call attention, we'll want to know which location and query they correspond to.
[ ] Lock the version of the dependencies of the project on which we run Merlin: Currently, we use Irmin as a code base to run the benchmarks on. We install Irmin's dependencies via opam without locking the versions of its dependencies. If a dependency splits or merges modules or increases the size of a module, the cmi-files and cmt-files will vary. That adds Merlin-independent noise to the benchmarks. To avoid that, we could vendor a fixed version of each dependency.
[ ] Find a more significant project input base. For now, we only use Irmin as a code base to run the benchmark on.
[ ] Our CI will be very resource heavy. We'll need to decide when to run the benchmarks. current-bench supports running the benchmarks only "on demand" (i.e. when tagging the PR with a certain flag).
[ ] Possibly: It might also be interesting to track the number of latency spikes.

Jun 26 '23 12:06 pitag-ha

@3Rafal, is there anything you'd add?

Jun 26 '23 12:06 pitag-ha