opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

We can't accurately calculate the critically blocking path for async services. (Should we add `previous_sibling_span_id` as an optional span attribute?)

Open magicmark opened this issue 1 year ago • 15 comments

When analyzing the distributed trace for a page, we often want to understand “what is responsible for making this page slow”?

Specifically – What are the critically blocking spans? This can be quite tricky…

Example

Consider the following simplified trace for a webpage:

Untitled (7)

We can see here that improving the performance of foo won’t immediately fix the overall TTFB since we’d still be blocked on bar.

If we want to draw a “red line” through the trace representing the critically blocking spans, it’s easy to do so here:

shapes at 24-06-12 16 30 16

All fine and dandy.

Now how about this one? Good luck :)

Untitled (8)

All of these foo spans are sibling spans. They all share the same parent (a server span of the request into services/abc).

Consider:

  • Is foo5 blocked on foo2? Or does it just look that way? Maybe it’s actually blocked on foo3, but we don’t have an annotation for the work happening in between.
  • Does foo7 block on foo4 or foo6? Or both?!
  • Can you draw the red line of the critical path of blocking spans?

In a nodejs environment, or python+asyncio, the above scenario is very real.

Can we solve this today?

No.

(See the edit history for this comment to see my initial thoughts - i've since learned a lot about this problem!)

Tracing UIs may use a hueristic algorithm to calculate and visualize the critical path - but with the current information encoded, this is only safe to do in synchronous services.

For async services, we don't have enough information to accurately work out the critical path where there are concurrent spans. See https://github.com/jaegertracing/jaeger-ui/issues/2371 for a minimal repro of this issue demonstrated with Jaeger.

Why is this the case?

  • We only encode dependency information between spans if they’re a parent->child relationship.
  • In contrast, we don’t* store the dependency relationships between bar, baz and qux below
def foo():
  bar_results = bar()
  baz_results = baz(bar_results)
  return qux(baz_results)
  • (*we store the time_start and time_end for each span, but that’s not enough to draw a direct relationship)

The ask

I propose that we add something new to the OTel spec (previous_sibling_span_id?) to solve this problem. (The implementation of this is trickier - i have a few thoughts here, but wanted to get alignment on the problem itself first.)

Thanks!

magicmark avatar Jun 12 '24 23:06 magicmark

fwiw, assuming this functionality doesn't yet exist, i could imagine some fancy build-time sourcemap-like solution that looks at the ast and stores the ordering of relationships between sibling function calls. We would then upload this data for a web ui viewer to use, and draw the red line relationships for us.

Prior art: Uploading JS sourcemaps to make stack traces from production readable.

Is anyone aware of any projects to do this, or anything else that solves this problem?

EDIT: maybe https://opentelemetry.io/docs/concepts/signals/traces/#span-links maybe this is what i want? can we encode this in span links? so specifically, this would be rephrased as "can we get the sdk to auto-populate span links data?

akthough https://github.com/jaegertracing/jaeger/issues/4516 seems like even if we did that, it's unclear if it's useful yet with existing open source viewers or not

magicmark avatar Jun 12 '24 23:06 magicmark

FWIW I think this is probably more the domain of telemetry backends than OTel itself. I'm aware of at least one (service now observability, formerly lightstep) that has robust critical path calculations in their trace waterfall.

As far as how a backend would do it, I suppose it depends on how pedantic you want to get. In the case of many sibling spans, there is still a critical path, even if fixing it doesn't make the wall-clock time materially faster due to siblings of a similar duration. What you'd want out of that backend is also some way to calculate siblings with a similar duration and highlight them in addition to the critical path.

cartermp avatar Jun 15 '24 04:06 cartermp

Thanks @cartermp!

Would be really curious how existing UIs calculate the critical path - the current data provided in spans doesn't seem sufficient to do so accurately as we don't encode causal relationships between sibling spans - only t_start and t_end, which isn't sufficient (hence why I'm wondering if this needs to be part of the spec itself)

magicmark avatar Jun 15 '24 05:06 magicmark

For non-async traces, critical paths are pretty easy to calculate (just walk backwards!) but yes, because there's nothing between peer spans saying that they're peers, you need to "draw" the whole trace and make that observation yourself.

I think the issue here is that you cannot calculate a "true and accurate" critical path. It's really that there's a critical path with a block in the middle of it made up of some sibling spans who may or may not be a part of that critical path in another trace. And practically speaking, even if they were consistently placed within a trace, it's often the case that reducing the latency of one of those operations wouldn't materially impact the total latency because a sibling of a similar duration would take its place.

My view at least is that this calls for two things: a best-effort critical path calculation and a way to visualize places within that path that are impacted by siblings such that they are not necessarily going to truly be a part of that path in other traces.

cartermp avatar Jun 15 '24 22:06 cartermp

Jaeger has critical path support, you might be interested to take a look into the implementation:

https://github.com/jaegertracing/jaeger-ui/pull/1582

Interesting site note: systemd-analyze also has a critical path analysis, you might be interested in that implementation as well: https://www.freedesktop.org/software/systemd/man/latest/systemd-analyze.html

svrnm avatar Jun 17 '24 08:06 svrnm

@magicmark can you describe a situation where sibling spans are blocking each other?

tedsuo avatar Jun 18 '24 20:06 tedsuo

@tedsuo

@magicmark can you describe a situation where sibling spans are blocking each other?

Consider the following (synchronous) code:

@trace
def foo():
  bar()
  baz()
  qux()

Assume that foo, bar, baz and qux are all annotated to be traced -- to directly answer your question, and make sure we're on the same page, this is the base case in my mind of "blocking sibling spans"

i.e. you'd get a trace like this:

[foo..................]
   [bar]
        [baz..]
               [qux...] 

In this synchronous example, it's easy to work out the critical path and see what blocks what.

But now what about this?

@trace
async def foo():
  await asyncio.gather(bar(), baz())
  qux()

Now it starts to get a bit tricky, you'd see something like this:

[foo..................]
   [bar.......]
   [baz..]
               [qux...] 

Now imagine that bar and baz themselves are async methods with their own nested parallelism, and you get a trace like what I posted in the OP. And it becomes very difficult to understand when looking at a large app which functions are the best candidates for optimization by looking at traces.

magicmark avatar Jun 21 '24 17:06 magicmark

Sorry....I think i take this all back actually!

As has been pointed out, jaeger shows the critical path (marked by the black line) out of the box when I download the latest version and run it locally. I wrote some sample code to model the trace in the OP, and it totally works:

Screenshot 2024-06-21 at 5 01 34 PM
repro code fwiw if anyone is interested
import asyncio

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    SimpleSpanProcessor,
    ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(ConsoleSpanExporter())
)
tracer = trace.get_tracer("async_example")

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(resource_name):
        await asyncio.sleep(length)

async def foo5():
    await make_fake_network_request("foo5", 0.12)

async def foo2():
    await make_fake_network_request("foo2", 0.17)
    await foo5()

async def foo6():
    await make_fake_network_request("foo6", 0.12)

async def foo3():
    await make_fake_network_request("foo3", 0.1)
    await foo6()

async def foo7():
    await make_fake_network_request("foo7", 0.3)

async def foo4():
    await make_fake_network_request("foo4", 0.2)
    await foo7()

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.gather(foo2(), foo3(), foo4())

@tracer.start_as_current_span("main")
async def main():
    await asyncio.sleep(0.1)
    await foo1()

asyncio.run(main())

...i think...this is exactly what i'm looking for? Still not sure how this works exactly and accounts for the different possible interpretations of execution order (see OP) but i'll dig in more to understand.

(Thanks @GLVSKiriti!)

magicmark avatar Jun 21 '24 22:06 magicmark

That said....this is an interesting trace which I'll dig into more

Screenshot 2024-06-21 at 5 10 21 PM

I changed the code such that foo7 actually blocks on foo3, but it still shows that it's blocking on foo4 🤔

import asyncio

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    SimpleSpanProcessor,
    ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(ConsoleSpanExporter())
)
tracer = trace.get_tracer("async_example")

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(resource_name):
        await asyncio.sleep(length)

async def foo5():
    await make_fake_network_request("foo5", 0.12)

async def foo2():
    await make_fake_network_request("foo2", 0.17)
    await foo5()

async def foo6():
    await make_fake_network_request("foo6", 0.12)

async def foo3():
    await make_fake_network_request("foo3", 0.1)
    await asyncio.sleep(0.1)
    await foo7()


async def foo7():
    await make_fake_network_request("foo7", 0.3)

async def foo4():
    await make_fake_network_request("foo4", 0.2)
    # await foo7()

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.gather(foo2(), foo3(), foo4())

@tracer.start_as_current_span("main")
async def main():
    await asyncio.sleep(0.1)
    await foo1()

asyncio.run(main())

magicmark avatar Jun 21 '24 22:06 magicmark

Yeah, the way most critical path systems work is they work backwards from the end of the trace and (sometimes via recursion) figure out what's most blocking. Uber has a good post on it here: https://www.uber.com/blog/crisp-critical-path-analysis-for-microservice-architectures/

However, as you're finding, when async spans or "interesting groups of siblings" are concerned, it's not always the same path from trace to trace and there may be some inconsistencies.

cartermp avatar Jun 21 '24 23:06 cartermp

@cartermp thanks! So I haven't fully dug into it yet, but I'm wondering if i'm just running into what I originally suspected which is that the trace data fundamentally doesn't provide enough information to accurately work out the critical path.

Working backwards makes sense, but in an async world, you effectively have to "guess" which was the blocking previous span? And sometimes we get it wrong, as seen above? (Minor discrepancies between difference traces for the same page aside, this seems like a reproducible issue)

magicmark avatar Jun 22 '24 00:06 magicmark

To confirm, this is my minimal repro:

Screenshot 2024-06-21 at 8 41 06 PM

This visualization is wrong ^

/get foo3 is not blocked by /get foo2

Produced by this code:

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(f'get /{resource_name}'):
        await asyncio.sleep(length)

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.sleep(0.1)
    await foo3()

async def foo2():
    await make_fake_network_request("foo2", 0.2)

async def foo3():
    await make_fake_network_request("foo3", 0.3)

@tracer.start_as_current_span("main")
async def main():
    await asyncio.gather(foo1(), foo2())

asyncio.run(main())
full code (including imports)
import asyncio

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    SimpleSpanProcessor,
    ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(ConsoleSpanExporter())
)
tracer = trace.get_tracer("async_example")

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(f'get /{resource_name}'):
        await asyncio.sleep(length)

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.sleep(0.1)
    await foo3()

async def foo2():
    await make_fake_network_request("foo2", 0.2)

async def foo3():
    await make_fake_network_request("foo3", 0.3)

@tracer.start_as_current_span("main")
async def main():
    await asyncio.gather(foo1(), foo2())

asyncio.run(main())

^ It seems that when the algorithm walks backwards, it can only do so heuristically? In this case, this highlights the wrong path.

Hence this as Github issue on the spec itself - I don't think we don't provide enough data to do this reliably?

(I could imagine that by providing extra data in spans (specifically previous_sibling_span_id), and making the sdk do this work, this visualization would be possible)

magicmark avatar Jun 22 '24 01:06 magicmark

Are you mostly interested in inter-process (inside a single service, no RPCs) analysis? Would profiling be a better solution here?

austinlparker avatar Jul 23 '24 20:07 austinlparker

@austinlparker yes and no - in our current real life situation, yes, we're mostly focused on the network calls emitted by one particular service.

But we might in future also be interested in the critical path of spans throughout a whole set of asynchronous services. I do think it's reasonable to try and solve this with otel and add previous_sibling_span to the spec.

(profiles are kinda noisy and also don't clearly show you the critical path. e.g. async callbacks in node don't carry the stack information for when/who the parent was.)

magicmark avatar Jul 23 '24 21:07 magicmark

hey @magicmark, following up on this: can you provide some more details how the solution you envision (adding a previous_sibling_span to the spec) would solve this particular issue? Can you show using your exising example what would be different and how a tool like jaeger could utilize this information

svrnm avatar Sep 30 '24 09:09 svrnm