opentelemetry-js-contrib icon indicating copy to clipboard operation
opentelemetry-js-contrib copied to clipboard

@opentelemetry/instrumentation-express handler does not run in correct context

Open aabmass opened this issue 1 year ago • 5 comments

What version of OpenTelemetry are you using?

├── @opentelemetry/[email protected]
└─┬ @opentelemetry/[email protected]
  ├── @opentelemetry/[email protected] deduped
  └── @opentelemetry/[email protected]

What version of Node are you using?

v18.19.1

What did you do?

Run the included express sample https://github.com/open-telemetry/opentelemetry-js-contrib/tree/main/examples/express

What did you expect to see?

Child spans of the handler should be children under the request handler span

What did you see instead?

Child spans created under the request handler fall under the top level server span (GET /run_test), which I believe is created by the @opentelemetry/instrumentation-http. I.e. they appear as siblings of the server span, next to middleware. This is really confusing because you can't tell if the span belongs to a middleware or the handler.

This screenshot demonstrates the problems image

Additional context

It looks like the wrapped handler is not executed with the created span set in context:

https://github.com/open-telemetry/opentelemetry-js-contrib/blob/9df30ea0fd822a69203b818b8fbe34e1e1c8bced/plugins/node/opentelemetry-instrumentation-express/src/instrumentation.ts#L308

Maybe this is intentional for 3-arg handlers that accept a next() function, as the next layer would be running in the parent's span context. If that is considered incorrect, you could probably solve this by wrapping the next() function to restore the parent context before executing the next layer to keep everything as siblings.

aabmass avatar Mar 18 '24 17:03 aabmass

If I am understanding this issue correctly, I believe this was discussed on #1657 and we were torn on how best to approach a change here. As mentioned in this comment, the result of changing the nesting of spans significantly changes the visualization in a trace, which caused some concern. As traces get longer and spans become more nested, it becomes nearly impossible to view those spans meaningfully. For example, here is what the example app looked like in Zipkin after making a change.

zipkin current and proposed behavior

This is worth further discussion, though, so I appreciate the issue being raised. Please also let me know if I am misunderstanding the issue described here.

JamieDanielson avatar Mar 21 '24 20:03 JamieDanielson

@JamieDanielson thanks for the context. I think the naive change would create the nested "proposed" result. I agree it's not ideal and not necessarily even more correct since we should be able to flatten out the callback hell.

Maybe we could hack around this by explicitly setting the parent span to the one created by @opentelemetry/instrumentation-http to break the nesting.

Do you know if/how we already solve this for other callback (non-promise) APIs like many of node's internal APIs that we instrument? I know we have de-coupled OTel context from the AsyncLocalStorage impl but it is used in Node in practice–I'm not an expert here but I think AsyncResources is supposed to let async frameworks manage context correctly (see here). This old issue may be relevant https://github.com/expressjs/express/issues/4396 too.

aabmass avatar Mar 26 '24 15:03 aabmass

Going to use this issue to collect some thoughts and throw some ideas around wrt how the express instrumentation currently handles span hierarchy and context propagation. I'll probably be at the SIG tomorrow if others want to maybe have a discussion about this and figure out the best path forward.

Request handlers

Request handlers are currently not instrumented properly and always produce bogus spans of little value. I don't know what the historical reason for that is but the current status is pretty awful. They also don't report errors at all at the moment.

I think in the case of request handlers we definitely want to propagate the context. It would be a big change in how traces are displayed but it wouldn't dramatically increase nesting if it stays a direct child span of the root express server span, and overall I really don't think anyone would find this makes anything less clear.

Middleware

They're properly instrumented for the most part, but also don't propagate context. I'd argue we'd also want them to do that, so that for instance an authentication middleware that queries a database actually creates that query span as a child of the middleware span. Since we agreed nesting middleware renders things unreadable at scale, I like the proposed solution of patching next to restore the parent context. We'd mostly keep the current way traces are displayed with a readability boost for complex middleware that call into other stuff.

Router spans

Similarly to route handlers spans these are bogus. Do we want the route handler to be a child of the router or keep things as is ? Don't really have an opinion here apart from "we should actually measure their real duration at least".

raphael-theriault-swi avatar Mar 26 '24 16:03 raphael-theriault-swi

I see this was in the SIG agenda but I had a conflict. Was there any resolution?

aabmass avatar Mar 27 '24 16:03 aabmass

We agreed that context should probably be propagated for both but need to make sure it actually wouldn't break existing uses. I'll probably implement the proposed changes to have a way to compare both in practice too.

raphael-theriault-swi avatar Mar 28 '24 20:03 raphael-theriault-swi