merlin
merlin copied to clipboard
Benchmark CI
The latency of Merlin queries depends on a lot of different factors, such as
- The global buffer it's run on; in particular, size and typing complexity.
- The location inside the buffer it's run on.
- The dependency graph of the buffer.
- Whether and which PPX is applied.
- Merlin's cache state at the moment the query is run.
- Which Merlin query is run.
So for meaningful benchmark results, we need to run Merlin on a big variety of input samples. We've written merl-an to generate such an input sample set in a random but deterministic way. It has a merl-an benchmark command, which persists the telemetry part of the Merlin response in the format expected by current-bench.
The next steps to get a Merlin benchmark CI up and running are:
- [x] Finish the PoC for a
current-benchCI on Merlin usingmerl-an. We're currently blocked on this by a current-bench issue. Done: see PoC graphs - [x] Improve the separation into different benchmarks (in
merl-an): I think, with the currentmerl-anoutput,current-benchwill create one different graph for each file that's being benchmarked. That doesn't scale. Instead: One graph per cache workflow and per query or similar. - [x] Improve the Docker set-up: The whole benchmark set-up, such as installing
merl-anand fetching the code base on which we run Merlin should be done inside the container etc. - [ ] Filter out spikes (on
merl-an). Non-reproducible latency spikes (i.e. timings that exceed the expected timing by over factor 10), mess up the scale of thecurrent-benchgraphs. - [ ] Add cold-cache workflow to the benchmarks: The reason why the numbers look so good at the moment is that both cmi-caches and typer cache are fully warmed on all queries. Additionally, it would be interesting to have benchmarks for when the caches are cold.
- [ ] Improve the output UX: When some samples call attention, we'll want to know which location and query they correspond to.
- [ ] Lock the version of the dependencies of the project on which we run Merlin: Currently, we use Irmin as a code base to run the benchmarks on. We install Irmin's dependencies via
opamwithout locking the versions of its dependencies. If a dependency splits or merges modules or increases the size of a module, the cmi-files and cmt-files will vary. That adds Merlin-independent noise to the benchmarks. To avoid that, we could vendor a fixed version of each dependency. - [ ] Find a more significant project input base. For now, we only use Irmin as a code base to run the benchmark on.
- [ ] Our CI will be very resource heavy. We'll need to decide when to run the benchmarks.
current-benchsupports running the benchmarks only "on demand" (i.e. when tagging the PR with a certain flag). - [ ] Possibly: It might also be interesting to track the number of latency spikes.
@3Rafal, is there anything you'd add?