MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

System tracing provider for tracing via native probes (BPF)

Open lc525 opened this issue 2 years ago • 1 comments

This PR introduces a new tracing provider, enabling dynamic attachment of native (BPF, Systemtap) probes at runtime. The goal of the provider is to allow correlation between MLServer-specific events and operating system behaviour (system load, performance, resource usage, etc).

Approach

The provider exposes a number of tracepoints (hook functions without any code attached), which can be triggered (fired) when particular application-level events happen (e.g. a model gets loaded/unloaded, an inference request joins a queue, etc).

At runtime, external probes (BPF programs, Systemtap scripts) can be attached to those tracepoints and to perform tracing actions (measurements, in-linux-kernel data aggregations, correlating MLServer context with OS context, etc).

The underlying implementation creates and dynamically links a native shared library where the tracepoint hooks exist as functions containing just a couple of nop instructions. When external probes are attached, the code of those tracepoints is modified at runtime to jump into the tracing code.

Features

  • Completely optional feature, with the ability to enable it via the tracepoints extra
  • The exact tracepoints exposed for external probing configurable via settings
  • Near-zero overhead when not in use (no external probes attached). Because tracepoints look like normal functions from the perspective of python code, and receive arguments which might need to be computed, the provider offers a way of conditionally computing the tracepoint arguments based on whether an external probe has been attached to a particular tracepoint or not.

Introduced dependencies

  • When the tracepoints extra is enabled, MLServer will require additional dependencies, as follows:
  • MLServer will continue to work as normal even if the extra is enabled but the dependencies are not met
  • The Dockerfile has been updated to allow for container images in which system tracing dependencies are installed

TODOs

  • [ ] Add instrumentation to the inference request path (including for batched requests)
  • [ ] Stabilise tracepoint arguments
  • [ ] Write documentation example for simple usage via bpftrace
  • [ ] Move to "off-by-default" settings in settings/Dockerfile (left to enabled atm for testing)

lc525 avatar Jul 07 '23 10:07 lc525

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 22 '24 17:05 CLAassistant