operator icon indicating copy to clipboard operation
operator copied to clipboard

ops[tracing] follow up

Open dimaqq opened this issue 1 year ago • 14 comments

Listing all the FIXME's I'm removing from #1612

  • [x] use in machine charms -- Dima needs to read this up
    • example charm using cos-agent instead of direct relations with COS: https://github.com/canonical/kafka-operator/blob/main/src/charm.py#L62
    • at first glance: that's metrics and logs, but not tracing
  • [x] update PYPI trusted publishing config
    • pypi.org
    • test.pypi.org
  • [ ] jhack support
  • [x] charm name from metadata
  • [ ] rewrite the vendored cert lib in pure Python
  • [ ] rewrite the vendored tracing lib in pure Python
  • [ ] drop pydantic dependency
  • [ ] update the docs wrt. the above
  • [ ] clean up / fix bugs in the tracing charm lib (see below)
  • [x] docs: document the low-level API, ops.tracing.set_destination()
  • [ ] docs: document that the URL is the full URL, with the URL path
  • [ ] docs: document that (url, ca) are saved in the database and should be updated when they change upstream
  • [ ] docs: what depending on ops[tracing] does, incl. pydantic and rust toolchain build dep
  • [ ] docs: document that collect-metrics is incompatible with ops[tracing]
  • [ ] docs: may want to split the current one page into the two/three use cases:
    • how to trace your charm (from scratch)
      • code example for tracer = ... object
      • code example for custom spans
      • code example for custom events
    • [maybe?] how to trace your charm libs
      • (mostly following the above, but simpler)
      • explanation how that works when a charm uses the lib
      • a paragraph on what happens when an old charm uses the lib
    • how to migrate from charm-tracing charm lib
      • tbd
  • [ ] #1774 (single tox.ini)
  • [x] #1729
    • https://charmhub.io/topics/canonical-observability-stack/reference/juju-topology
    • https://matrix.to/#/!wJiiHsLipVywuWOyNi:ubuntu.com/$k_Ff1ldo_cI0-S7FDm6admK4NRlI0_nYBQj00DfB_qM?via=ubuntu.com&via=matrix.org
  • [ ] should we add the event's .snapshot() attributes as well? for relation events etc... that could be very handy. -- @PietroPasotti
  • [x] minor cleanup:
    • observer_path.split('/')[-1].split('[')[0] vs observer.__class__.__qualname__

dimaqq avatar Mar 14 '25 02:03 dimaqq

Image

  • the relation data is set twice on the same relation
    • first it's cleared
    • then the exporter protocol is set
  • that's done in init
  • which means that's done on every single hook

dimaqq avatar Mar 27 '25 04:03 dimaqq

@PietroPasotti ☝ maybe it's something you may want to fix upstream

dimaqq avatar Mar 27 '25 04:03 dimaqq

Re: re-coding databag operations in pure Python:

  # Databag operations can be trivially re-coded in pure Python:
  # - we'd get rid of pydantic dependency
  # - we'd avoid init-time relation listing and databag update (tracing)
  # - it's trivial to read the relation data bags, see below.
  #
  #
  # certificate_transfer:
  # - client doesn't write to the databag
  # - client reads this from the databag:
  # ```
  # {"certificates": list[str]}
  # ```
  #
  #
  # tracing:
  # - client leader sets own app data on every relation:
  # ```
  # {"receivers": ["otlp_http"]}
  # ```
  #
  # - client reads this from remote_app_data:
  # ```
  # {"receivers": [
  #     {
  #         "protocol": {"name": "otlp_http", "type": "http"},
  #         "url": "http//somewhere:4318/v1/traces",
  #     },
  #     ...,
  # ]}
  # ```

dimaqq avatar Apr 09 '25 15:04 dimaqq

From Pietro:

As soon as we have a trace ID, would you mind logging it (preferably in a way that's compatible with the existing log message) so we can pick it up with jhack?

https://discourse.charmhub.io/t/how-to-instrument-your-charm-with-tracing-telemetry/11012#p-26285-mapping-events-to-traces-with-jhack-tail-1

dimaqq avatar Apr 15 '25 09:04 dimaqq

Regarding the comment in the previous issue about getting the Juju team's buy-in on Juju Topology, I tend to agree with Ben - this has been an agreed convention for observability for over three years at this point and customers, as well as other product teams, are relying on it to function this way to do filtering, aggregation and alerting.

I would hope/expect as part of this initiative, that all traces by default end up being tagged with all five labels that make up juju topology.

simskij avatar Apr 17 '25 12:04 simskij

Here's the current state for trace data, as set by charm tracing charm lib:

https://github.com/canonical/tempo-coordinator-k8s-operator/blob/e6e18fe53023fb968ebf43dec67a18b851143a13/lib/charms/tempo_coordinator_k8s/v0/charm_tracing.py#L754-L763

Which is a bit different from log data, as set by the loki push api charm lib:

https://github.com/canonical/tempo-coordinator-k8s-operator/blob/e6e18fe53023fb968ebf43dec67a18b851143a13/lib/charms/loki_k8s/v1/loki_push_api.py#L2485-L2492

In short:

  • juju_xxx tags are consistent
  • other tags are not

dimaqq avatar Apr 28 '25 11:04 dimaqq

I've removed test grouping / parallelisation from the integration tests.

Here are some ideas how to expose groups in a more reasonable way, if we want to use them:

https://github.com/dimaqq/operator/actions/runs/14744282224

Image

Code: https://github.com/dimaqq/operator/pull/43

Guts:

      - run: |
          # Report selected tests
          uvx tox -e integration -q \
            -- \
            --collect-only -q \
            --test-group-random-seed=0 \
            --test-group-count=4 \
            --test-group=${{ matrix.test-group }} |
          sed -r 's/\x1B\[[0-9;]*[a-zA-Z]//g' |
          awk 'BEGIN { print "## Selected tests\n```" } { print } END { print "```" }' >> $GITHUB_STEP_SUMMARY

dimaqq avatar Apr 30 '25 01:04 dimaqq

We may need to handle more errors in the trace data exporter.

Scenario:

  • charm -- tempo, charm -- cert, tempo -- cert: HTTPS export URL
  • remove-app cert: switch to HTTP URL

Log:

│ 2025-05-01T04:03:16.221Z [container-agent] 2025-05-01 04:03:16 ERROR juju-log charm-tracing:26: Exception while exporting Span batch.                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/ops_tracing/_export.py", line 141, in do_export                                                                                                                                                                                                                                                                                                                                                                                 │
│ 2025-05-01T04:03:16.221Z [container-agent]     with urllib.request.urlopen(  # noqa: S310                                                                                                                                                                                                                                                                                                                                                                                                                                                                       │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen                                                                                                                                                                                                                                                                                                                                                                                                                                                  │
│ 2025-05-01T04:03:16.221Z [container-agent]     return opener.open(url, data, timeout)                                                                                                                                                                                                                                                                                                                                                                                                                                                                           │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/urllib/request.py", line 531, in open                                                                                                                                                                                                                                                                                                                                                                                                                                                     │
│ 2025-05-01T04:03:16.221Z [container-agent]     response = meth(req, response)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response                                                                                                                                                                                                                                                                                                                                                                                                                                            │
│ 2025-05-01T04:03:16.221Z [container-agent]     response = self.parent.error(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/urllib/request.py", line 569, in error                                                                                                                                                                                                                                                                                                                                                                                                                                                    │
│ 2025-05-01T04:03:16.221Z [container-agent]     return self._call_chain(*args)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain                                                                                                                                                                                                                                                                                                                                                                                                                                              │
│ 2025-05-01T04:03:16.221Z [container-agent]     result = func(*args)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default                                                                                                                                                                                                                                                                                                                                                                                                                                       │
│ 2025-05-01T04:03:16.221Z [container-agent]     raise HTTPError(req.full_url, code, msg, hdrs, fp)                                                                                                                                                                                                                                                                                                                                                                                                                                                               │
│ 2025-05-01T04:03:16.221Z [container-agent] urllib.error.HTTPError: HTTP Error 400: Bad Request                                                                                                                                                                                                                                                                                                                                                                                                                                                                  │
│ 2025-05-01T04:03:16.221Z [container-agent]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
│ 2025-05-01T04:03:16.221Z [container-agent] During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                                                                                                                                                                                                                                  │
│ 2025-05-01T04:03:16.221Z [container-agent]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/http/client.py", line 556, in _get_chunk_left                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ 2025-05-01T04:03:16.221Z [container-agent]     chunk_left = self._read_next_chunk_size()                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/http/client.py", line 523, in _read_next_chunk_size                                                                                                                                                                                                                                                                                                                                                                                                                                       │
│ 2025-05-01T04:03:16.221Z [container-agent]     return int(line, 16)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ 2025-05-01T04:03:16.221Z [container-agent] ValueError: invalid literal for int() with base 16: b''                                                                                                                                                                                                                                                                                                                                                                                                                                                              │
│ 2025-05-01T04:03:16.221Z [container-agent]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
│ 2025-05-01T04:03:16.221Z [container-agent] During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                                                                                                                                                                                                                                  │
│ 2025-05-01T04:03:16.221Z [container-agent]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/http/client.py", line 573, in _readall_chunked                                                                                                                                                                                                                                                                                                                                                                                                                                            │
│ 2025-05-01T04:03:16.221Z [container-agent]     chunk_left = self._get_chunk_left()                                                                                                                                                                                                                                                                                                                                                                                                                                                                              │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/http/client.py", line 558, in _get_chunk_left                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ 2025-05-01T04:03:16.221Z [container-agent]     raise IncompleteRead(b'')                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │
│ 2025-05-01T04:03:16.221Z [container-agent] http.client.IncompleteRead: IncompleteRead(0 bytes read)                                                                                                                                                                                                                                                                                                                                                                                                                                                             │
│ 2025-05-01T04:03:16.221Z [container-agent]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
│ 2025-05-01T04:03:16.221Z [container-agent] During handling of the above exception, another exception occurred:                                                                                                                                                                                                                                                                                                                                                                                                                                                  │
│ 2025-05-01T04:03:16.221Z [container-agent]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      │
│ 2025-05-01T04:03:16.221Z [container-agent] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 362, in _export_batch                                                                                                                                                                                                                                                                                                                                                         │
│ 2025-05-01T04:03:16.221Z [container-agent]     self.span_exporter.export(self.spans_list[:idx])  # type: ignore                                                                                                                                                                                                                                                                                                                                                                                                                                                 │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/ops_tracing/_export.py", line 85, in export                                                                                                                                                                                                                                                                                                                                                                                     │
│ 2025-05-01T04:03:16.221Z [container-agent]     self.do_export(*rv)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/var/lib/juju/agents/unit-hexanator-0/charm/venv/lib/python3.8/site-packages/ops_tracing/_export.py", line 153, in do_export                                                                                                                                                                                                                                                                                                                                                                                 │
│ 2025-05-01T04:03:16.221Z [container-agent]     resp = e.fp.read()[:1000]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/http/client.py", line 466, in read                                                                                                                                                                                                                                                                                                                                                                                                                                                        │
│ 2025-05-01T04:03:16.221Z [container-agent]     return self._readall_chunked()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   │
│ 2025-05-01T04:03:16.221Z [container-agent]   File "/usr/lib/python3.8/http/client.py", line 580, in _readall_chunked                                                                                                                                                                                                                                                                                                                                                                                                                                            │
│ 2025-05-01T04:03:16.221Z [container-agent]     raise IncompleteRead(b''.join(value))                                                                                                                                                                                                                                                                                                                                                                                                                                                                            │
│ 2025-05-01T04:03:16.221Z [container-agent] http.client.IncompleteRead: IncompleteRead(0 bytes read)

Eventually this stabilises.

Perhaps we should recognise and silence such errors? Or recognise and log them in some concise format?

dimaqq avatar May 01 '25 04:05 dimaqq

Advise charming teams to test against grafana-agent, not tempo.

dimaqq avatar May 09 '25 09:05 dimaqq

On the subjects of:

  • log and trace data correlation
  • jhack support

It would be enough if we:

  • add smth like [traceid, spanid] to each log line that's sent to Juju
  • there's at least one log line per dispatch

Log line "Emitting Juju event ..." -- check if the root trace is already set up at that point.

There's also "Emitting deferred event" and "Emitting custom event" in ops/framework.py

Pietro's current code:

span = _tracer.start_span(root_span_name, attributes={"juju.dispatch_path": dispatch_path})

        # all these shenanigans are to work around the fact that the opentelemetry tracing API is built
        # on the assumption that spans will be used as contextmanagers.
        # Since we don't (as we need to close the span on framework.commit),
        # we need to manually set the root span as current.
        ctx = set_span_in_context(span)

        # log a trace id, so we can pick it u
root_trace_id = hex(span.get_span_context().trace_id)[2:]  # strip 0x prefix
        logger.debug(f"Starting root trace with id={root_trace_id!r}.")

dimaqq avatar Jun 10 '25 07:06 dimaqq

Today, charm logs are not delivered to COS, because Juju controller cannot be related to Loki.

There's experimental https://github.com/canonical/loki-k8s-operator/blob/main/lib/charms/loki_k8s/v0/charm_logging.py though that still needs to be thought through (push in-line? batch? a helper process? etc.).

So, today, traceId is only useful to jhack.

The tool invocation is smth like jhack tail -t.

dimaqq avatar Jun 10 '25 07:06 dimaqq

One idea: we could emit extra logs at TRACE level, so that by default (DEBUG) they are not emitted/captured.

Some folks do collect logs from production, and then process them with jhack offline.

dimaqq avatar Jun 10 '25 07:06 dimaqq

Re: format, presumably tracedId: "42424242" with this capitalisation, with or without quotes, fixed or variable width traceid hex representation, etc.

Prior art: https://www.reddit.com/r/devops/comments/1ggg14y/correlating_logs_and_traces_in_grafana/

Upstream docs: https://grafana.com/docs/grafana/latest/explore/trace-integration/

dimaqq avatar Jun 10 '25 07:06 dimaqq

Improve ops tracing integration tests / a little simpler setup: https://github.com/canonical/pyroscope-k8s-operator/pull/109/files#diff-fe4b41d4a56a8bea9ef29cdcb16196e3fb1bee5eda7f04cacbfe7da1acfe6fc8

dimaqq avatar Jun 24 '25 07:06 dimaqq

I've created separate issues for future work.

dimaqq avatar Aug 12 '25 14:08 dimaqq