runbot icon indicating copy to clipboard operation
runbot copied to clipboard

Tracing support

Open xmo-odoo opened this issue 1 year ago • 2 comments

This is mostly for the test suite as currently the experience is not great (especially when a problem chain involves dummy-central).

Rough idea:

  • [x] ~~have a fixture generate a trace id (per test) if enabled~~ pytest-opentelemetry does that out of the box, although it generates a trace for the entire test suite too which is kinda not great
  • [ ] do magic so capsys & co output their stuff to tracing (?)
  • [x] make env pass that to the Environment, update Environment to add the trace id as a header to xmlrpc (ServerProxy(header) support was added in 3.8)
  • [ ] mangle logging to support tracing output (otel apparently has some integration with logging)
  • [ ] similarly with make_repo and the github session
  • [x] update the mergebot/forwardbot to forward the trace id when it calls github (can just get it from the request headers)
  • [x] update DC to take up that trace id and add it to its tracing context
  • [ ] find a way to collate all the tracing stuff back into pytest's stream capture things, so it's all displayed nice and tight in case of failure (see CaptureManager, possibly terminal_reporter, not entirely sure how to make pytest retrieve the tracing data, most likely all three processes need some config to sync on)
  • [x] Hook conftest onto --export-traces instead of pytest-opentelemetry being enabled: pytest-opentelemetry doesn't actually do anything until you --export-traces.
  • [ ] Get chrisguidry/pytest-opentelemetry#34 merged because the trace-per-suite is really not great with hundreds of long-running tests, especially as both Venator and otel-desktop-viewer have really poor support for zooming onto spans (it should be improved for venator 1.1).
  • [ ] Maybe see with infra if they'd be interested in a cleaned up version of the odoo-otel module.
  • [x] Overwrite the http. attributes in the trace propagation through git as currently it copies what it receives so on the dummy central side we might get e.g. http.target=/xmlrpc/2/object and http.user_agent=Python-xmlrpc.
  • [x] Create a span (and maybe set an attribute) with the git op name as on the git repo side it's just calling one of info/refs, git-upload-pack, git-receive-pack, or objects/{s}/{ha} but we don't really know why it's calling any of those at the moment, that has to be inferred from the surrounding logs / spans.

xmo-odoo avatar Jan 18 '24 13:01 xmo-odoo

Technically not quite done but I don't think the conftest stuff can be done (or is useful, possibly), and I don't think it's worth waiting for the more external actions.

xmo-odoo avatar Feb 24 '25 14:02 xmo-odoo

Note: enabling tracing seems to have relatively little effect on runtime, running the test suite locally,

  • with pytest-opentelemetry disabled, 4751.33s user 802.19s system 497% cpu 18:36.41 total
  • with tracing disabled, 5699.38s user 927.39s system 540% cpu 20:25.90 total
  • with tracing enabled, 6568.21s user 1141.04s system 599% cpu 21:25.13 total

Note that the various CPU times are not ultra representative as they don't include dummy-central and postgres, and may not included the Odoo child process either (not really sure). The point is mostly that the increase in wallclock is pretty reasonable, because the ~20mn increase in user+system from opentelemetry overhead is entirely absorbed by an increase in CPU load, resulting in just 5% increase in wallclock. Now obviously that could translate to the system being less responsive / closer to the edge which is not great, but I think given the increase in debugging comfort of being able to trace what the hell is happening with the correct ordering, I think it's pretty cool.

Would be useful to track down why running with tracing disabled does incur more overhead (over disabling pytest-opentelemetry) than enabling tracing does tho.

xmo-odoo avatar Feb 25 '25 08:02 xmo-odoo