MLServer
MLServer copied to clipboard
System tracing provider for tracing via native probes (BPF)
This PR introduces a new tracing provider, enabling dynamic attachment of native (BPF, Systemtap) probes at runtime. The goal of the provider is to allow correlation between MLServer-specific events and operating system behaviour (system load, performance, resource usage, etc).
Approach
The provider exposes a number of tracepoints (hook functions without any code attached), which can be triggered (fired) when particular application-level events happen (e.g. a model gets loaded/unloaded, an inference request joins a queue, etc).
At runtime, external probes (BPF programs, Systemtap scripts) can be attached to those tracepoints and to perform tracing actions (measurements, in-linux-kernel data aggregations, correlating MLServer context with OS context, etc).
The underlying implementation creates and dynamically links a native shared library where the tracepoint hooks exist as functions containing just a couple of nop instructions. When external probes are attached, the code of those tracepoints is modified at runtime to jump into the tracing code.
Features
- Completely optional feature, with the ability to enable it via the
tracepointsextra - The exact tracepoints exposed for external probing configurable via settings
- Near-zero overhead when not in use (no external probes attached). Because tracepoints look like normal functions from the perspective of python code, and receive arguments which might need to be computed, the provider offers a way of conditionally computing the tracepoint arguments based on whether an external probe has been attached to a particular tracepoint or not.
Introduced dependencies
- When the
tracepointsextra is enabled, MLServer will require additional dependencies, as follows:- stapsdt
- libstapsdt (a native library, in turn requiring
libelf)
- MLServer will continue to work as normal even if the extra is enabled but the dependencies are not met
- The Dockerfile has been updated to allow for container images in which system tracing dependencies are installed
TODOs
- [ ] Add instrumentation to the inference request path (including for batched requests)
- [ ] Stabilise tracepoint arguments
- [ ] Write documentation example for simple usage via
bpftrace - [ ] Move to "off-by-default" settings in settings/Dockerfile (left to enabled atm for testing)