runbot
runbot copied to clipboard
Tracing support
This is mostly for the test suite as currently the experience is not great (especially when a problem chain involves dummy-central).
Rough idea:
- [x] ~~have a fixture generate a trace id (per test) if enabled~~ pytest-opentelemetry does that out of the box, although it generates a trace for the entire test suite too which is kinda not great
- [ ] do magic so capsys & co output their stuff to tracing (?)
- [x] make
envpass that to theEnvironment, updateEnvironmentto add the trace id as a header to xmlrpc (ServerProxy(header)support was added in 3.8) - [ ] mangle logging to support tracing output (otel apparently has some integration with logging)
- [ ] similarly with
make_repoand the github session - [x] update the mergebot/forwardbot to forward the trace id when it calls github (can just get it from the request headers)
- [x] update DC to take up that trace id and add it to its tracing context
- [ ] find a way to collate all the tracing stuff back into pytest's stream capture things, so it's all displayed nice and tight in case of failure (see CaptureManager, possibly terminal_reporter, not entirely sure how to make pytest retrieve the tracing data, most likely all three processes need some config to sync on)
- [x] Hook conftest onto
--export-tracesinstead ofpytest-opentelemetrybeing enabled:pytest-opentelemetrydoesn't actually do anything until you--export-traces. - [ ] Get chrisguidry/pytest-opentelemetry#34 merged because the trace-per-suite is really not great with hundreds of long-running tests, especially as both Venator and otel-desktop-viewer have really poor support for zooming onto spans (it should be improved for venator 1.1).
- [ ] Maybe see with infra if they'd be interested in a cleaned up version of the odoo-otel module.
- [x] Overwrite the
http.attributes in the trace propagation through git as currently it copies what it receives so on the dummy central side we might get e.g.http.target=/xmlrpc/2/objectandhttp.user_agent=Python-xmlrpc. - [x] Create a span (and maybe set an attribute) with the git op name as on the git repo side it's just calling one of
info/refs,git-upload-pack,git-receive-pack, orobjects/{s}/{ha}but we don't really know why it's calling any of those at the moment, that has to be inferred from the surrounding logs / spans.
Technically not quite done but I don't think the conftest stuff can be done (or is useful, possibly), and I don't think it's worth waiting for the more external actions.
Note: enabling tracing seems to have relatively little effect on runtime, running the test suite locally,
- with pytest-opentelemetry disabled, 4751.33s user 802.19s system 497% cpu 18:36.41 total
- with tracing disabled, 5699.38s user 927.39s system 540% cpu 20:25.90 total
- with tracing enabled, 6568.21s user 1141.04s system 599% cpu 21:25.13 total
Note that the various CPU times are not ultra representative as they don't include dummy-central and postgres, and may not included the Odoo child process either (not really sure). The point is mostly that the increase in wallclock is pretty reasonable, because the ~20mn increase in user+system from opentelemetry overhead is entirely absorbed by an increase in CPU load, resulting in just 5% increase in wallclock. Now obviously that could translate to the system being less responsive / closer to the edge which is not great, but I think given the increase in debugging comfort of being able to trace what the hell is happening with the correct ordering, I think it's pretty cool.
Would be useful to track down why running with tracing disabled does incur more overhead (over disabling pytest-opentelemetry) than enabling tracing does tho.