ops[tracing] follow up
Listing all the FIXME's I'm removing from #1612
- [x] use in machine charms -- Dima needs to read this up
- example charm using cos-agent instead of direct relations with COS: https://github.com/canonical/kafka-operator/blob/main/src/charm.py#L62
- at first glance: that's metrics and logs, but not tracing
- [x] update PYPI trusted publishing config
- pypi.org
- test.pypi.org
- [ ] jhack support
- [x] charm name from metadata
- [ ] rewrite the vendored cert lib in pure Python
- [ ] rewrite the vendored tracing lib in pure Python
- [ ] drop pydantic dependency
- [ ] update the docs wrt. the above
- [ ] clean up / fix bugs in the tracing charm lib (see below)
- [x] docs: document the low-level API,
ops.tracing.set_destination() - [ ] docs: document that the URL is the full URL, with the URL path
- [ ] docs: document that (url, ca) are saved in the database and should be updated when they change upstream
- [ ] docs: what depending on
ops[tracing]does, incl. pydantic and rust toolchain build dep - [ ] docs: document that
collect-metricsis incompatible withops[tracing] - [ ] docs: may want to split the current one page into the two/three use cases:
- how to trace your charm (from scratch)
- code example for
tracer = ...object - code example for custom spans
- code example for custom events
- code example for
- [maybe?] how to trace your charm libs
- (mostly following the above, but simpler)
- explanation how that works when a charm uses the lib
- a paragraph on what happens when an old charm uses the lib
- how to migrate from charm-tracing charm lib
- tbd
- how to trace your charm (from scratch)
- [ ] #1774 (single tox.ini)
- [x] #1729
- https://charmhub.io/topics/canonical-observability-stack/reference/juju-topology
- https://matrix.to/#/!wJiiHsLipVywuWOyNi:ubuntu.com/$k_Ff1ldo_cI0-S7FDm6admK4NRlI0_nYBQj00DfB_qM?via=ubuntu.com&via=matrix.org
- [ ] should we add the event's
.snapshot()attributes as well? for relation events etc... that could be very handy. -- @PietroPasotti - [x] minor cleanup:
-
observer_path.split('/')[-1].split('[')[0]vsobserver.__class__.__qualname__
-
- the relation data is set twice on the same relation
- first it's cleared
- then the exporter protocol is set
- that's done in init
- which means that's done on every single hook
@PietroPasotti ☝ maybe it's something you may want to fix upstream
Re: re-coding databag operations in pure Python:
# Databag operations can be trivially re-coded in pure Python:
# - we'd get rid of pydantic dependency
# - we'd avoid init-time relation listing and databag update (tracing)
# - it's trivial to read the relation data bags, see below.
#
#
# certificate_transfer:
# - client doesn't write to the databag
# - client reads this from the databag:
# ```
# {"certificates": list[str]}
# ```
#
#
# tracing:
# - client leader sets own app data on every relation:
# ```
# {"receivers": ["otlp_http"]}
# ```
#
# - client reads this from remote_app_data:
# ```
# {"receivers": [
# {
# "protocol": {"name": "otlp_http", "type": "http"},
# "url": "http//somewhere:4318/v1/traces",
# },
# ...,
# ]}
# ```
From Pietro:
As soon as we have a trace ID, would you mind logging it (preferably in a way that's compatible with the existing log message) so we can pick it up with jhack?
https://discourse.charmhub.io/t/how-to-instrument-your-charm-with-tracing-telemetry/11012#p-26285-mapping-events-to-traces-with-jhack-tail-1
Regarding the comment in the previous issue about getting the Juju team's buy-in on Juju Topology, I tend to agree with Ben - this has been an agreed convention for observability for over three years at this point and customers, as well as other product teams, are relying on it to function this way to do filtering, aggregation and alerting.
I would hope/expect as part of this initiative, that all traces by default end up being tagged with all five labels that make up juju topology.
Here's the current state for trace data, as set by charm tracing charm lib:
https://github.com/canonical/tempo-coordinator-k8s-operator/blob/e6e18fe53023fb968ebf43dec67a18b851143a13/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py#L754-L763
Which is a bit different from log data, as set by the loki push api charm lib:
https://github.com/canonical/tempo-coordinator-k8s-operator/blob/e6e18fe53023fb968ebf43dec67a18b851143a13/lib/charms/loki_k8s/v1/loki_push_api.py#L2485-L2492
In short:
-
juju_xxxtags are consistent - other tags are not
I've removed test grouping / parallelisation from the integration tests.
Here are some ideas how to expose groups in a more reasonable way, if we want to use them:
https://github.com/dimaqq/operator/actions/runs/14744282224
Code: https://github.com/dimaqq/operator/pull/43
Guts:
- run: |
# Report selected tests
uvx tox -e integration -q \
-- \
--collect-only -q \
--test-group-random-seed=0 \
--test-group-count=4 \
--test-group=${{ matrix.test-group }} |
sed -r 's/\x1B\[[0-9;]*[a-zA-Z]//g' |
awk 'BEGIN { print "## Selected tests\n```" } { print } END { print "```" }' >> $GITHUB_STEP_SUMMARY
We may need to handle more errors in the trace data exporter.
Scenario:
- charm -- tempo, charm -- cert, tempo -- cert: HTTPS export URL
- remove-app cert: switch to HTTP URL
Log:
│ 2025-05-01T04:03:16.221Z [container-agent] 2025-05-01 04:03:16 ERROR juju-log charm-tracing:26: Exception while exporting Span batch. │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last): │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/ops_tracing/_export.py", line 141, in do_export │
│ 2025-05-01T04:03:16.221Z [container-agent] with urllib.request.urlopen( # noqa: S310 │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen │
│ 2025-05-01T04:03:16.221Z [container-agent] return opener.open(url, data, timeout) │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/urllib/request.py", line 531, in open │
│ 2025-05-01T04:03:16.221Z [container-agent] response = meth(req, response) │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response │
│ 2025-05-01T04:03:16.221Z [container-agent] response = self.parent.error( │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/urllib/request.py", line 569, in error │
│ 2025-05-01T04:03:16.221Z [container-agent] return self._call_chain(*args) │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain │
│ 2025-05-01T04:03:16.221Z [container-agent] result = func(*args) │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default │
│ 2025-05-01T04:03:16.221Z [container-agent] raise HTTPError(req.full_url, code, msg, hdrs, fp) │
│ 2025-05-01T04:03:16.221Z [container-agent] urllib.error.HTTPError: HTTP Error 400: Bad Request │
│ 2025-05-01T04:03:16.221Z [container-agent] │
│ 2025-05-01T04:03:16.221Z [container-agent] During handling of the above exception, another exception occurred: │
│ 2025-05-01T04:03:16.221Z [container-agent] │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last): │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/http/client.py", line 556, in _get_chunk_left │
│ 2025-05-01T04:03:16.221Z [container-agent] chunk_left = self._read_next_chunk_size() │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/http/client.py", line 523, in _read_next_chunk_size │
│ 2025-05-01T04:03:16.221Z [container-agent] return int(line, 16) │
│ 2025-05-01T04:03:16.221Z [container-agent] ValueError: invalid literal for int() with base 16: b'' │
│ 2025-05-01T04:03:16.221Z [container-agent] │
│ 2025-05-01T04:03:16.221Z [container-agent] During handling of the above exception, another exception occurred: │
│ 2025-05-01T04:03:16.221Z [container-agent] │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last): │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/http/client.py", line 573, in _readall_chunked │
│ 2025-05-01T04:03:16.221Z [container-agent] chunk_left = self._get_chunk_left() │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/http/client.py", line 558, in _get_chunk_left │
│ 2025-05-01T04:03:16.221Z [container-agent] raise IncompleteRead(b'') │
│ 2025-05-01T04:03:16.221Z [container-agent] http.client.IncompleteRead: IncompleteRead(0 bytes read) │
│ 2025-05-01T04:03:16.221Z [container-agent] │
│ 2025-05-01T04:03:16.221Z [container-agent] During handling of the above exception, another exception occurred: │
│ 2025-05-01T04:03:16.221Z [container-agent] │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last): │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 362, in _export_batch │
│ 2025-05-01T04:03:16.221Z [container-agent] self.span_exporter.export(self.spans_list[:idx]) # type: ignore │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/ops_tracing/_export.py", line 85, in export │
│ 2025-05-01T04:03:16.221Z [container-agent] self.do_export(*rv) │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/ops_tracing/_export.py", line 153, in do_export │
│ 2025-05-01T04:03:16.221Z [container-agent] resp = e.fp.read()[:1000] │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/http/client.py", line 466, in read │
│ 2025-05-01T04:03:16.221Z [container-agent] return self._readall_chunked() │
│ 2025-05-01T04:03:16.221Z [container-agent] File "/usr/lib/python3.8/http/client.py", line 580, in _readall_chunked │
│ 2025-05-01T04:03:16.221Z [container-agent] raise IncompleteRead(b''.join(value)) │
│ 2025-05-01T04:03:16.221Z [container-agent] http.client.IncompleteRead: IncompleteRead(0 bytes read)
Eventually this stabilises.
Perhaps we should recognise and silence such errors? Or recognise and log them in some concise format?
Advise charming teams to test against grafana-agent, not tempo.
On the subjects of:
- log and trace data correlation
- jhack support
It would be enough if we:
- add smth like
[traceid, spanid]to each log line that's sent to Juju - there's at least one log line per dispatch
Log line "Emitting Juju event ..." -- check if the root trace is already set up at that point.
There's also "Emitting deferred event" and "Emitting custom event" in ops/framework.py
Pietro's current code:
span = _tracer.start_span(root_span_name, attributes={"juju.dispatch_path": dispatch_path})
# all these shenanigans are to work around the fact that the opentelemetry tracing API is built
# on the assumption that spans will be used as contextmanagers.
# Since we don't (as we need to close the span on framework.commit),
# we need to manually set the root span as current.
ctx = set_span_in_context(span)
# log a trace id, so we can pick it u
root_trace_id = hex(span.get_span_context().trace_id)[2:] # strip 0x prefix
logger.debug(f"Starting root trace with id={root_trace_id!r}.")
Today, charm logs are not delivered to COS, because Juju controller cannot be related to Loki.
There's experimental https://github.com/canonical/loki-k8s-operator/blob/main/lib/charms/loki_k8s/v0/charm_logging.py though that still needs to be thought through (push in-line? batch? a helper process? etc.).
So, today, traceId is only useful to jhack.
The tool invocation is smth like jhack tail -t.
One idea: we could emit extra logs at TRACE level, so that by default (DEBUG) they are not emitted/captured.
Some folks do collect logs from production, and then process them with jhack offline.
Re: format, presumably tracedId: "42424242" with this capitalisation, with or without quotes, fixed or variable width traceid hex representation, etc.
Prior art: https://www.reddit.com/r/devops/comments/1ggg14y/correlating_logs_and_traces_in_grafana/
Upstream docs: https://grafana.com/docs/grafana/latest/explore/trace-integration/
Improve ops tracing integration tests / a little simpler setup: https://github.com/canonical/pyroscope-k8s-operator/pull/109/files#diff-fe4b41d4a56a8bea9ef29cdcb16196e3fb1bee5eda7f04cacbfe7da1acfe6fc8
I've created separate issues for future work.