hydroflow icon indicating copy to clipboard operation
hydroflow copied to clipboard

Hydro Observability meta-issue

Open MingweiSamuel opened this issue 2 months ago • 8 comments

Previous iteration of this issue: https://github.com/hydro-project/hydro/issues/928

Purpose & Background

We would like to integrate observability directly into Hydro to provide users with a powerful but easy monitoring experience, batteries-included. Our north start is observability that should be powerful enough to diagnose subtle bugs/performance issues in production Hydro deployments. And ideally should just work with minimal configuration.

Our observability story should cover a few key categories of information:

  • Tracing - the logical flow of execution across time [and location, for global tracing]
  • Logging - unstructured messages at instants in time
  • Metrics - structured data corresponding to key properties of the system at instants in time
  • Profiling - the complete CPU (or memory) usage distribution across code
  • Crash reporting - info about where/when/why the process panics, crashes, or Result::Errors

Tracing

The traces provides context as to where events happen (where events could be both logging and metrics).

Tracing works with spans, which have a start and end. Spans form a hierarchical tree, where each span may have multiple child spans between their start and end. For local tracing, this is used to represent [important] function calls and form a function call stack. Across machines, function calls are replaced by inter-machine RPCs or request-responses, but the call stack structure is the same.

This unfortunately maps very poorly to dataflow graphs. Most dataflow systems flatten the span heirarchy to just a few levels, e.g. corresponding to the machine, graph, subgraph, and operator. I propose to follow this paradigm for now.

The other option is to match the trace to the path of a datum through the system. This works for simple linear chains of operators, ~~but fails for joins, tees, etc.. In the most extreme case, each datum would carry its entire provenance in a list of spans.~~. I believe this is possible to do with a custom tracing system, and will be later work (see comment below).

Logging

Logs are unstructured human-readable text messages emitted at instants in time. Logging is relatively uncomplicated, but we should make sure it records tracing context properly. log is standard in Rust, and integrates with tracing and pretty much anything else.

Metrics

Metrics are numeric properties (structured data) of the system at instants in time. A few standard types are counters (monotonically increasing integers), gauges (up and down), and histograms

The AWS OSS metrique crate provides an alternative model of unit of work metrics, where multiple metrics are grouped together in a standard Rust struct and emitted as a single unit. This maps well to request-response model but not dataflow. However, we should eventually support metrique at the Hydro user-level, and leave it up to the user to define what their unit of work is. (Simplest impl option is to carry metrique struct instances within the dataflow).

At the DFIR level, I plan to expose an API to inspect runtime telemetry, similar to Tokio's RuntimeMetrics. At the hydro_deploy-level, I plan to read the DFIR and Tokio metrics and export them in a pluggable/configurable way.

Profiling

Profiling is done outside the process (DFIR), at the OS/kernel-level, and therefore requires separate hydro_deploy-level mechanisms. CPU profiling and flamegraph visualization is already well-supported in hydro_deploy with the caveat that it only works for the entire execution of a program. For long-running programs we may want a way to create "live" (over-time) flamegraphs as the system runs.

Crash reporting

With crash reporting, we should ensure stack traces are emitted and manually or automatically mappable back to DFIR and Hydro dataflow operators.

Plan

Our plan is to keep telemetry flexible and handle its movement primarily within Hydro Deploy. We have easy high-level control of how we want to handle metrics, thanks to the DFIR runtime and Hydro Deploy. This is in contrast with conventional services which require more ergonomics around metric handling and exporting.

Hydro Deploy will have pluggable/configurable telemetry export targets. In the Rust process, when we run a DFIR graph, we can also easily periodically report metrics from DFIR and Tokio. Separately, outside of the Rust process, Hydro deploy can record/export profiling data (CPU profiling already works) and record/process/export crash data if the process crashes/panics.

architecture-beta
    group deploy(cloud)[Deployment]
    service egress(internet)[Hydro Deploy observability] in deploy

    group w0(server)[Server 0] in deploy
    service w0_crash(blank)[Crash panic reporter] in w0
    service w0_profile(blank)[Profile recorder] in w0

    group w0_proc(blank)[Rust Tokio process] in w0

    service w0_tokio_metrics(blank)[Tokio metrics] in w0_proc
    service w0_exporter(blank)[Metrics reporter] in w0_proc

    group w0_dfir(blank)[DFIR] in w0_proc
    service w0_dfir_metrics(blank)[DFIR metrics] in w0_dfir

    w0_tokio_metrics:L --> R:w0_exporter
    w0_dfir_metrics:T --> B:w0_exporter
    w0_crash:T --> B:egress
    w0_exporter:T --> R:egress
    w0_profile:T --> L:egress
    w0_profile:R -- L:w0_crash

Tracing within DFIR

I plan to implement tracing within the DFIR dataflow graph as a three level hierarchy of process -> subgraph -> operator. This will involve somewhat advanced use of the defacto-standard tracing crate. Logging within user operators will have the span context automatically attached. In the future, user metrics (from the metrics or metrique crates) will as well.

User-code logging

Logging will be done with the standard log or tracing crates, and we should also properly handle [e]println!() calls.

Telemetry export targets

For initial versions, I plan to export to local files on to the developer's (i.e. leader) machine (as we currently do with CPU profiles) via the Open Telemetry format. After this, the first priority is to support AWS CloudWatch via OTel and/or EMF. In the futures arbitrary targets can be added per customer needs (DataDog, Prometheus, GCP Cloud Monitoring, Azure Monitor, etc).

Alternatives

For tracing, spanning the topological history of data adds significant performance overhead. However if users do want this level of per-datum tracing, user-level code that utilizes metrique or similar may be able to provide this functionality. This leaves it up to the user to decide if the performance cost is worth it, per pipeline.

Logging is relatively standardized and there are not really any accepted alternatives to log/tracing/println!()

For metrics, exiting solutions include libraries like metrique and metrics, and format standards like Open Telemetry or AWS EMF. Our plan uses standard telemetry concepts which translate well to any of these libraries/formats, so there is no reason to lock in to a specific library or format, and can instead translate from Rust data into different formats as needed.

Profiling is already implemented and kept as is unless customer needs push us in a new direction.

Plan for crash reporting remains basic for now with few decisions made.

MingweiSamuel avatar Oct 08 '25 22:10 MingweiSamuel

Tracing

  • Provenance is a very powerful differentiator for dataflow debugging. And i suspect for Hydro-style programs it would scale OK for lots of use cases (as opposed to queries over giant tables). It might be opt-in for users due to cost, but I think we should make it very easy to use.
  • On a related but lower-function note, can we be more specific about the integration into distributed tracing tools like OpenTelemetry or Zipkin? Do we benefit from their ability to assemble events across distributed services? Does our dataflow model allow us to do better than that?
  • In both cases I imagine we want a compilation flag that wraps each dataflow operator with logging/provenance hooks that either pass state or log state so it can be assembled cumulatively across the flow (either live or post-hoc from log reconstruction).

Metrics

  • I don't understand the discussion of why metrique is a misfit for dataflow, and how you propose to use it later in the doc despite that misfit.

jhellerstein avatar Oct 10 '25 19:10 jhellerstein

When we start adding log support we'll need to be careful about the captured spans. Since the staged code is copy-pasted as is, the spans actually resolve in the trybuild since the macro expansion happens after the q! proc macro. We already have this issue with dbg!.

I think the workaround will probably to have q! special case handle some macro calls to eagerly expand them and capture the original spans (serialization might be tricky if they use the proc_macro::Span...).

shadaj avatar Oct 10 '25 19:10 shadaj

The other option is to match the trace to the path of a datum through the system. This works for simple linear chains of operators, but fails for joins, tees, etc.. In the most extreme case, each datum would carry its entire provenance in a list of spans.

It's not clear to me how that would actually work. When does the span data get emitted in this case?

We plan to implement tracing within the DFIR dataflow graph as a three level hierarchy of process -> subgraph -> operator.

Traces in otel usually require a root trace id, what is the trace id in this model? Is it just a single trace that has spans that get repeatedly re-entered? or will it emit a new trace-id on each dfir loop iteration?

luckyworkama avatar Oct 10 '25 21:10 luckyworkama

the crash reporting/tracing process is a sidecar, how does this interact with containerized deployments, do you think multiple processes inside the single container or multiple containers on the same host. I think stuff like tracing needs to be inside the same container otherwise resolving addresses becomes very weird.

luckyworkama avatar Oct 10 '25 21:10 luckyworkama

@jhellerstein

Alright, I will look into provenance a bit more to see how possible it is to make that work now, maybe it is more possible than I am giving credit.

Literature: https://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper37.pdf https://cseweb.ucsd.edu/~kyocum/pubs/SoCC-2013-CR.pdf https://www.vldb.org/pvldb/vol9/p216-interlandi.pdf https://cloud.google.com/dataflow/docs/guides/lineage

For integrations, OTel vs Zipkin vs Datadog etc, my understanding is that each vendor has different formats for emitting trace/logging/metrics, but everyone supports the OTel formats, so we will use that for now. And see what limitation we run into.

Clarified about metrique, it may fit for users in that they can define their own unit-of-work, but in Hydro overall it's not clear what the unit of work should be.


@shadaj I think you may be worrying about proc_macro::Spans, which are different from tracing::Spans. The later are more configurable/customizable. (Though maybe not a bad idea to include the former's source line number info in the later)


@luckyworkama

When does the span data get emitted in this case?

Any events i.e. log emissions or metric emissions would contain the tracing data as context, is what I'm imagining.

Traces in otel usually require a root trace id, what is the trace id in this model? Is it just a single trace that has spans that get repeatedly re-entered? or will it emit a new trace-id on each dfir loop iteration?

Not sure, the data conveyed is the same regardless, so whichever seems to work best with monitoring tools. Maybe have to go back to the drawing board if none are useful.

For containerized deployments, I imagine multiple processes within the same container for sure.

Before containers, any crash reporting would be done via the SSH client in hydro_deploy. Result::Err reporting can always be done from within the DFIR process.

MingweiSamuel avatar Oct 20 '25 18:10 MingweiSamuel

So I believe we could get full ("record-level") tracing-support for provenance/lineage by carrying a u64 ID alongside each item in the dataflow, in the hot path.[src].

The user would want to enable/disable this without having to change their business code, but the codegen would have to change to support this (hide the extra data from the user).

MingweiSamuel avatar Oct 21 '25 18:10 MingweiSamuel

Full Tracing

I spent a few days studying how full record-level tracing/provenance/lineage could work in DFIR. This will cover my dive into how the tracing crate works and how we can do better to avoid even more overhead.

Full post moved here

MingweiSamuel avatar Oct 23 '25 21:10 MingweiSamuel

Looking into how best to instrument execution of subgraphs. tokio-metrics provides comprehensive general instrumentation of async tasks, but testing using it in DFIR shows 15% longer runtime on some microbenchmarks. tokio-metrics records more properties than we care about [right now], and uses atomics and Arc to support multiple threads while we could track things directly in SubgraphData. Based on this I think it is best to implement custom instrumentation of subgraphs in DFIR with tokio-metrics as inspiration.

MingweiSamuel avatar Oct 29 '25 23:10 MingweiSamuel