rerun icon indicating copy to clipboard operation
rerun copied to clipboard

Spans and flame graphs

Open emilk opened this issue 1 year ago • 5 comments

Currently all our data is associated with a single instance in time - they are events.

There is however many things that require data to span a time range, such as audio and video.

Another useful thing to use spans for is for flame graphs, which is a way to visualize a call graph:

image

Such a flame graph is useful for profiling, but also for observability, i.e. understanding how a piece is connected.

Implementation

And easy way to implement this is to use a special enum Span { Begin, End } component.

We then have a special time range query that is aware of spans (i.e. querying for data in the time range [#10, #20] must include time ranges that spans [#0, #30]).

Together with a special Flame Graph Space View we have a pretty good start.

Threads and processes

For multi-threaded or multi-processed data we must have one flame-graph per thread:

puffin_egui

We should also be able to record relationships between threads. For instance, we want to be able to see that thread A is blocked waiting on thread B and C (see also https://github.com/EmbarkStudios/puffin/issues/174).

This also means log events should come with a ProcessId and ThreadId component.

API

For Python, a with scope makes sense, as does a function decoration:

@rr.span
def my_function(images):
    for image in images:
        with rr.span(f"image {image.name}"):
            process(image)

…with optional recording argument

In Rust and C++ we would need to use macros, similar to e.g. puffin and loguru.

See also

  • https://github.com/rerun-io/rerun/issues/2852
  • https://github.com/rerun-io/rerun/issues/2963
  • https://github.com/rerun-io/rerun/issues/4622

emilk avatar Jan 02 '24 11:01 emilk

It probably makes a lot of sense to both take inspiration and make sure we're ultimately compatible with OpenTelemetry. In this case worth looking at the tracing package: https://opentelemetry-python.readthedocs.io/en/latest/api/trace.html

nikolausWest avatar Jan 02 '24 12:01 nikolausWest

Implementation

And easy way to implement this is to use a special enum Span { Begin, End } component.

We then have a special time range query that is aware of spans (i.e. querying for data in the time range [#10, #20] must include time ranges that spans [#0, #30]).

Together with a special Flame Graph Space View we have a pretty good start.

I'm not sure I understand why we need to introduce a new/special component for this?

Since the timepoint we have to day is effectively a start timepoint, an alternative implementation I had in mind was to introduce a second, optional timepoint for every log event, which specifies the end timepoint of the event (which therefore becomes a span rather than an event at this point). If the end timepoint isn't specified, then we only look at the start timepoint and consider the event to be instantaneous, as we do today. Otherwise it's a span.

This allows to have spans that cover different time units quite naturally (e.g. "this event spanned 278ms wall-clock time (log_time), 90 simulation ticks (sim_tick) and was instantaneous on the the frame timeline (frame_nr)"). Then I don't think we need to change anything query-wise? Haven't thought about it enough to be sure though.

teh-cmc avatar Jan 03 '24 07:01 teh-cmc

That is another way of implementing it for sure, but it is quite useful to be able to distinguish an event from a span, and it is also useful to be able to express half-open spans (spans with just a start or just an end).

I envision a flame-graph like view where log events (e.g. text and images) are shown as single point inside the span that contains them.

emilk avatar Jan 03 '24 10:01 emilk

Maybe we should separate these concepts as:

  • Event: data + a time point
  • Duration event: data + start and end time
  • Span: an operation (unit of work) + a start and end time.
    • A hierarchical set of spans make up a trace
      • A trace can be visualized as a flame graph
    • Multiple events can be produced within a span

nikolausWest avatar Jan 03 '24 14:01 nikolausWest

Some kind of flame graphs support would be great because it would allow you to create a Common Trace Format (CTF) data loader to analyze the program runtime behavior together with the generated output data.

mwopfner avatar Apr 02 '25 13:04 mwopfner