Ask/Discuss: Adding telemetry to Rattler-build (and ideally Pixi, too)
Ahoy prefixes (prefices?). At NVIDIA, we have lots of builds over a wide matrix of configurations and lots of projects. We are working on getting observability set up for these builds, so that we can gauge where we need to spend our time in optimizing builds (switching to rattler-build is one thing on that list!)
We have been working with OpenTelemetry as the ecosystem for generating traces and metrics, because it feeds into our Ops team's existing observability tools. For Python, OpenTelemetry has a nice way of monkeypatching projects such that you don't need to add OpenTelemetry as a dependency of the project. The conda-build instrumentation that I've been working on is at https://github.com/open-telemetry/opentelemetry-python-contrib/compare/main...msarahan:opentelemetry-python-contrib:add-conda-build-instrumentation?expand=1#diff-ca9904b9ee2690d04b86fe194f52bd13d393c80b3d9dada18e3f5e16861404b3.
That gets us span data in Grafana that shows us the functions that have been monkeypatched:
I'm really interested in having this same kind of capability with rattler-build (and ideally with Pixi, too). I don't know if the monkeypatching approach is viable with Rust, though. Here's the ask: would you consider adding the opentelemetry crate as a dependency? Would you be OK with the extra noise of tracing code in your codebase? Would you want to mark it under an optional build flag so that it wasn't available by default?
On the user side, I would expect this to be disabled by default, but could be enabled by setting standard OpenTelemetry environment variables. I would not add CLI flags for this. I might put something in rattler-build/pixi's configuration.
We are using the tracing crate throughout both rattler, rattler-build and pixi, enabling opentelemetry should be veeery straightforward using https://docs.rs/tracing-opentelemetry/latest/tracing_opentelemetry/. Im not sure if we have the time to do it, but we would very much welcome contributions!
Yep, I also agree that this could be cool!
We will be sure to PR anything that we create. No idea on timeline, though. Hopefully soon.
As discussed on https://conda-forge.zulipchat.com/#narrow/channel/457337-general/topic/CI.20speedups
I love OTEL for providing a holistic, accurate view of distributed systems. From the above thread, conda-forge has anecdotally made big gains on the (somewhat) multi-provider continuous delivery chain, but being able to back things up with hard data in a standards-based format would be awesome. This usually means both the before and after state need instrumentation.
From a rattler* perspective: would very much help in understanding where things are an issue at scale (like conda-forge). Small gains in some places might get multiplied by the number of outputs of a feedstock, and i've seen/created some real horror shows.
Having opentelemetry in a task runner like pixi would help provide the context missing in some CLI tools when traced, without having to be a rust expert.
At the conda-forge level: might be a bit harder to put that data someplace useful. Running a big agent somewhere to collect them and analyze them seems... like just another thing to maintain. It may be possible to store all the traces in azure/travis/whatever and pull them back out with a conda smithy command, then visualize them locally with e.g. jaeger (though I've had trouble keeping that building on conda-forge due to.... reasons).
@baszalmstra please reopen