merlin icon indicating copy to clipboard operation
merlin copied to clipboard

Benchmark CI

Open pitag-ha opened this issue 2 years ago • 1 comments

The latency of Merlin queries depends on a lot of different factors, such as

  • The global buffer it's run on; in particular, size and typing complexity.
  • The location inside the buffer it's run on.
  • The dependency graph of the buffer.
  • Whether and which PPX is applied.
  • Merlin's cache state at the moment the query is run.
  • Which Merlin query is run.

So for meaningful benchmark results, we need to run Merlin on a big variety of input samples. We've written merl-an to generate such an input sample set in a random but deterministic way. It has a merl-an benchmark command, which persists the telemetry part of the Merlin response in the format expected by current-bench.

The next steps to get a Merlin benchmark CI up and running are:

  • [x] Finish the PoC for a current-bench CI on Merlin using merl-an. We're currently blocked on this by a current-bench issue. Done: see PoC graphs
  • [x] Improve the separation into different benchmarks (in merl-an): I think, with the current merl-an output, current-bench will create one different graph for each file that's being benchmarked. That doesn't scale. Instead: One graph per cache workflow and per query or similar.
  • [x] Improve the Docker set-up: The whole benchmark set-up, such as installing merl-an and fetching the code base on which we run Merlin should be done inside the container etc.
  • [ ] Filter out spikes (on merl-an). Non-reproducible latency spikes (i.e. timings that exceed the expected timing by over factor 10), mess up the scale of the current-bench graphs.
  • [ ] Add cold-cache workflow to the benchmarks: The reason why the numbers look so good at the moment is that both cmi-caches and typer cache are fully warmed on all queries. Additionally, it would be interesting to have benchmarks for when the caches are cold.
  • [ ] Improve the output UX: When some samples call attention, we'll want to know which location and query they correspond to.
  • [ ] Lock the version of the dependencies of the project on which we run Merlin: Currently, we use Irmin as a code base to run the benchmarks on. We install Irmin's dependencies via opam without locking the versions of its dependencies. If a dependency splits or merges modules or increases the size of a module, the cmi-files and cmt-files will vary. That adds Merlin-independent noise to the benchmarks. To avoid that, we could vendor a fixed version of each dependency.
  • [ ] Find a more significant project input base. For now, we only use Irmin as a code base to run the benchmark on.
  • [ ] Our CI will be very resource heavy. We'll need to decide when to run the benchmarks. current-bench supports running the benchmarks only "on demand" (i.e. when tagging the PR with a certain flag).
  • [ ] Possibly: It might also be interesting to track the number of latency spikes.

pitag-ha avatar Jun 26 '23 12:06 pitag-ha

@3Rafal, is there anything you'd add?

pitag-ha avatar Jun 26 '23 12:06 pitag-ha