hydroflow icon indicating copy to clipboard operation
hydroflow copied to clipboard

feat(dfir_lang): full per-record tracing

Open MingweiSamuel opened this issue 1 month ago • 0 comments

From #2178

Full Tracing

I spent a few days studying how full record-level tracing/provenance/lineage could work in DFIR. This will cover my dive into how the tracing crate works and how we can do better to avoid even more overhead.

Quick summary is that for the graph reachability benchmark a toy model of full tracing gives ~15% overhead, while using the tracing crate has an overhead of ~60% (and does not even record lineage by default).

How tracing works

tracing has two relevant core concepts: Spans represent a period of time and associated context which can be entered and exited, and Subscribers which collect trace data as it is emitted.

Each span conceptually goes through a lifecycle of being registered, entered & exited, and then closed. However, spans are essentially stateless u64 IDs -- instead each of these span events is sent to a globally-registered subscriber which is responsible for the bookkeeping. Internally, the subscriber will store a slab of spans (and associated context data) and stack of currently-entered spans.

How DFIR tracing would be different

Although a tracing::Span is mainly a u64 ID, each also carries a reference to some metadata (name, verbosity level, file/module info, etc) and a reference to the subscriber, bringing each span to have an overall size of five u64s (40 bytes). In DFIR we can stick with just a single u64 ID. We know all the dataflow operators and graph structure at compile time (codegen time), so we can pre-compute/pre-generate the metadata. We would also have a single "one true subscriber" existing within the Dfir runtime instance, eliminating any need to track different subscribers.

Because DFIR runs within a Dfir runtime instance on a single thread, we have no need for a global subscriber and no contention across threads while handling spans. We can create a new span simply by incrementing an integer counter. We have no need to manage a stack of spans. We push to a local edge list buffer to records the "follows from" structure of lineage, and export that data outside of the hot path.

DFIR operators will need to handle the span IDs in various ways. Simple linear operators have no need to modify the span ID and can therefore pass it unchanged alongside the real data. union() and/or tee()s will need to create new spans to track which fork the data came from/went down (I am not 100% sure at this time which, or if both, should create new spans). Joins will take the span IDs from the individual joined records and create a new span which follows from both.

In #2208 I created some mock benchmarks to see how tracing may perform. For graph reachability (arguably the most meaningful benchmark) the hot-path overhead was about 15%. In contrast, the overhead for using tracing::Span was much higher at 60%. Partial tracing (only tracing a fraction of records) seems to not be worth it, as it still has a 10%+ overhead even when tracking 0% of records, due to needing to check and branch on if the span IDs exist.

The fork-join microbenchmark shows much higher overhead of ~60-65% for tracing because (1) every fork/join needs new tracing info and (2) the denominator is small since no meaningful work is being done.

Implementation in DFIR and Hydro

At the dfir_lang level, the whole codegen would need to be changed to support the span ID metadata alongside the actual data. Operators with multiple inputs/outputs would need bigger changes. This would be quite a big lift, modifying the codegen of every operator and possibly needing to create modified Stream and Sink traits. We also should ensure we have a way to disable lineage for super performance-critical pipelines (ideally with generics, to avoid maintaining multiple code paths).

At the DFIR graph/context level, span IDs would need to be generated (counted) and edges stored and later exported.

Between machines, tracing information can be passed over the network alongside data to track global lineage. Separate "machine IDs" can augment the local span IDs and be stored off of the hot path to avoid any ID collisions while avoiding overhead. All the tracing info can then be exported and aggregated by hydroscope or other tracing visualizers to re-create the lineage of any piece of data throughout the graph.

MingweiSamuel avatar Oct 31 '25 20:10 MingweiSamuel