aspnetcore Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places

Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places

Open rynowak opened this issue 3 years ago • 25 comments

Describe the bug

The long-lived circuits of Blazor server make distributed tracing not work as expected.

Since each circuit is effectively a long-lived request ... a lot of activity (pun intended) ends up getting traced under the same activity. Basically all outgoing requests/traces from inside of a Blazor server circuit (browser tab) will have the same parent span, which causes them to all get grouped together as a single "root cause" in tracing UIs.

To Reproduce

Create a Blazor server application. Use a button to fire off HTTP requests to another ASP.NET Core endpoint. Hook all of this up to your favorite distributed tracing system and enjoy.

All of the operations you start from a Blazor circuit will end inside the same single logical operation from a tracing POV.

Feb 02 '21 05:02 rynowak

/cc @davidfowl @noahfalk

Feb 02 '21 05:02 rynowak

@rynowak thanks for reporting for duty.

I imagine this is not only true for Blazor but for Websockets/SignalR in general.

Is there something we can do here to "disconnect" the distributed tracing?

Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?

Feb 02 '21 08:02 javiercn

Thanks for contacting us. We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

Feb 02 '21 08:02 ghost

Is there something we can do here to "disconnect" the distributed tracing?

I'm currently working around this by setting Activity.Current = null, and it seems to do the trick. I would imagine that a framework level fix would create a new root activity for each operation that comes through the circuit handler. You can use tags/baggage to attach things like a circuit ID so that correlation is possible.

So it is can be controlled ....

Is it really correct if we somehow disconnect the Blazor Server session from the original request that started it?

That's a great question 😁

I think the major downside of the current behavior is that a really really long-lived circuit could include thousands of descendant spans. Think about @danroth27 giving the counter demo for 10 hours as part of a charity livestream. In the worst case this will break downstream tooling, or in the best case just give you data that's hard to work with. I'm really not sure.

I might also be the first person to notice this, so I figured it was worth a discussion 🤷

Feb 02 '21 16:02 rynowak

@rynowak you bring up good points.

I'm not super familiar with the activity id APIs, but In general I believe the fix here is to "detach it" from the original request activity and then to start a new one per received message on the circuit as you suggest, and then tag it with the circuit info

Feb 02 '21 16:02 javiercn

cc @tarekgh

Thanks for raising this @rynowak. I'll try to take a little time in the next day or two to repro this and understand where the Activities are currently being created so I have more context. I may also find alternative suggestions on how the issue can be resolved. I imagine as .NET developers use distributed tracing more frequently it will behoove us to have good guidance and understanding of common patterns that show up.

You can use tags/baggage to attach things like a circuit ID so that correlation is possible

Links are a new concept added in the 5.0 release of DiagnosticSource.dll that may also be useful here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md#links-between-spans

Feb 03 '21 01:02 noahfalk

Can we get a debugger dump which we can see where the Activity objects get created. That may be faster I guess if possible. I am guessing there is some activities started and never stopped.

Feb 03 '21 01:02 tarekgh

It's an issue we need to address in signalr and then we can have a broader discussion about what behavior we want in general.

Feb 03 '21 02:02 davidfowl

I didn't have any background on how the Blazor app front-end and back-end communicate, but given @davidfowl's comment + a little debugging it appears to be SignalR protocol over WebSocket, is that accurate?

In repro app there was a short series of paired Activity start/stops but then 8th Activity start is when Kestrel receives the HTTP request that initiates the WebSocket and that one is long lived by design. Additional activity start/stop pairs get recorded re-entrantly as the server sends messages to itself when the button is clicked. In terms of what to do about it...

We probably want to define how a W3C trace id should be embedded in a SignalR request and then implement SignalR client and server code to transmit/receive those ids. This would allow us to have traces that start at the front-end and track fine-grained interactions such as clicking a button, similar to what is already possible using more traditional HTTP request/response patterns.
We need to figure out what happens if multiple layers of a protocol stack are sending different trace ids. I haven't seen prior art on this but maybe I haven't looked in the right places yet. I think we have a few options:

Avoid the issue by not sending a trace id on the HTTP GET that initiates the WebSocket
On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.

Feb 04 '21 09:02 noahfalk

[Blazor Server] appears to be SignalR protocol over WebSocket, is that accurate?

Yup

Feb 04 '21 17:02 danroth27

On the server when the nested SignalR protocol provides a new trace-id we could clear the HTTP level Activity and start the new one. Ideally we'd also restore the HTTP Activity after responding. There is a bug doing that today but I assume it will get fixed or we could work around it by manually setting Activity.Current. There may also be some value in adding a link between the HTTP level activity and the SignalR one.

Related to https://github.com/dotnet/aspnetcore/issues/18711

cc @BrennanConroy

Feb 05 '21 01:02 davidfowl

I'm gonna take a stab at this. I think it's more than the activity though that's related here. I don't think we should be propagating the execution context from the request to the hub execution. It's especially broken for long polling scenarios where there are multiple requests throughout the execution of the hub.

Feb 08 '21 13:02 davidfowl

Ryan wrote:

I might also be the first person to notice this, so I figured it was worth a discussion 🤷

I do remember someone else mentioning this in the context of SignalR a while back, I'm not having any luck finding the issue again unfortunately.

Feb 10 '21 17:02 ericsampson

I'm not sure how useful it's going to be to have a separate activity ID for every message flowing across the websocket connection. The traces will then just say things like "Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?

So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit. If this was an area of growing customer demand, we could start pulling more information out of the rendertree and navigation manager so (for example) we could say things like "Within circuit abcdef, while on URL /counter, in reaction to a click event on a button element with class btn counter rendered by CounterComponent, we made HTTP requests to the following URLs...". But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology. Also we'd have to use something like async locals to correlate actions to their causal events.

Another argument would be that tracing technologies could offer more value by moving beyond thinking in terms of short HTTP round trips and have first-class support for long-running connections (WebSockets in general, not just Blazor Server). I know that would only address the UX concern about how info is presented in the tracing UI, so it's still applicable to think about how Blazor Server can provide more human-readable context information as per the previous paragraph.

Feb 19 '21 13:02 SteveSandersonMS

So far it hasn't been a design goal to capture more information about each user activity within a Blazor Server circuit

I think its fine to decide that this isn't a priority for you right now if you aren't seeing enough signal that customers care. My guess is that customers will care about this in proportion to how often apps using SignalR are sluggish and how easy it is to correlate actions on one machine to specific code that runs on another tier. For example if there is always a nice 1:1 correspondence between pushing a button in a Blazor front-end and OnButtonClicked() method running on a single back-end + all the expensive work happens inside OnButtonClicked() then the value of instrumenting the SignalR protocol is fairly low. On the other hand if those SignalR communications are bouncing across relay servers and lots of different code on different machines is being invoked in response then there are numerous places where performance issues could creep in. It would be very hard to track it down without timestamps and causal links on the messages being sent and received at each tier.

If we don't want to do the work of sending identifiers on each message I do think there are lower level amounts of work that would still be quite helpful:

Starting a trace at the point each request is processed on the back-end. This is often the level of support that is present for web clients and servers. The operator of the service never gets any information about what happens client side but they do see that a request to a specific endpoint was generated and they can track that part of the work. If SignalR is only used in client->server communications and not in server<->server communications this could be a pretty comfortable place to be.
Suppress generating a trace-id that wraps the entire WebSocket connection. In theory a sophisticated tracing tool could break it all apart and extract useful value, but in practice improvements to long circuit scenarios are unlikely to a priority at the tooling end for a long time to come. Avoiding emitting a trace id on the web-socket will mean that tracing tools create unique IDs on the server for small chunks of work like database requests or further HTTP messages. This allows users to at least diagnose performance issues in the downstream components like databases and other micro-services even if they can't correlate it back to specific UI actions or incoming server requests.

But I don't personally know if or how this sort of detail can be communicated usefully to an upstream tracing technology

For a server<->server SignalR communication many of the supporting pieces are already in place and I could walk you through what remains. For client<->server we presumably also need one of the tracing uploaders like OpenTelemetry or ApplicationInsights to support running in the blazor client environment. I'm happy to get the right people in touch but I suspect its more a question or priorities at this point.

"Within circuit abcdef, following receipt of event 4728, we made HTTP requests to the following URLs...". Does the "event 4728" information help the person reading the trace?

It helps if there is instrumentation on the other end of the communication to tie it to. For server<->server usage there would probably be some of that instrumentation in place. For Blazor client<->server it would be there only if you added it (which isn't hard) and a telemetry uploader supported transmitting it (this part is likely a bit more work).

Hope that helps? I'm glad to chat online if that is easier, I know there is a lot here : )

Feb 20 '21 09:02 noahfalk

We're having the same problem.

Setup:

Blazor server side app, .NET 5
Public app-facing API, .NET Core.
Several internal APIs, .NET Core.

Example:

Blazor app calls public API several times on startup, like authentication and GET /config.
Blazor app then calls GET /Appointments.
User may then POST /Appointments or other user actions.

If we are interested in debugging POST /Appointments in Azure Application Insights end-to-end transaction, then we see ALL distributed tracing for ALL Blazor requests to the public APIs, including all internal API calls, SQL commands, Redis requests etc, for each and every Blazor http request since the user visited the web app.

This becomes extremely verbose and makes debugging even for short user sessions with Application Insights end-to-end transaction impractical. I would expect to be able to explicitly start activities in Blazor to logically group related Blazor requests and to opt-out of the Blazor circuit's root activity.

A workaround we are experimenting with right now is setting Activity.Current = null in several key places, such as upon creating a new circuit and upon resolving specific early scoped dependencies (per circuit).

Mar 24 '21 09:03 angularsen

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

Jul 20 '21 17:07 ghost

Workarounds are now listed here: https://github.com/dotnet/runtime/issues/41072#issuecomment-879395715

Jul 20 '21 23:07 angularsen

Are those all documented?

Jul 21 '21 12:07 ericsampson

Thanks for contacting us.

We're moving this issue to the .NET 7 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s). If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

Nov 23 '21 19:11 ghost

I cannot tell if there is some way to hook in Activity/ActivitySource (OpenTelemetry) into SignalR/Blazor server interactions.

There seems to be no way to hook into the pipeline for the websocket to see interactions no matter how noisy they may be. There is a desire to see tracing when a user clicks on a button and what outgoing HTTP, DB interactions, etc. happen just like an a regular HTTP API.

I recognize the problem isn't as easy as a lot is going on on the websocket. I was just hoping this problem was being worked on. This seems like the more relevant issue to follow. Thanks!

Feb 15 '22 15:02 adamhathcock

What is the recommended interim solution for websockets tracing to avoid this issue? Right now, entire connection and messages communicated are traced under one trace id and I can't find a proper way to break that in to smaller spans.

Jul 14 '22 13:07 hadaavidi

https://github.com/dotnet/aspnetcore/issues/29846#issuecomment-771775926

You can create individual spans for each websocket message by making your own activity for them.

Jul 14 '22 13:07 davidfowl

Excuse my limited knowledge on this.

For suggested solution above, do we have to indicate Activity.Current = Null after initial websocket connection established?

After initial connection is established for websockets, how can we create new activity for each message? Those subsequent messages won't go through middleware pipeline, right?

Jul 14 '22 13:07 hadaavidi

It stinks I can't add some kind of hook to instrument things. Blazor Server blurs the lines a lot so knowing where the appropriate spot can be tricky but I'll go down this road for now.

If anyone has any guidance to share on instrumenting Blazor Server, I'd be interested!

Sep 21 '22 07:09 adamhathcock

I just want to add that the behavior is different when using Azure SignalR Service. Every invocation of a request on the server will result in a new trace. It makes it a bit more confusing when the tracing will differ on the developer's machine (not using AzSignalR) vs. in production (using AzSignalR).

Sep 27 '22 08:09 SplitThePotCyrus

We've discussed this within the team today and have concluded that the best course of action for now will be to document how this can be achieved. For that:

@davidfowl will follow up with @BrennanConroy to confirm where exactly in SignalR to configure / set the activity ID.
@MackinnonBuck will identify how exactly to configure the activity ID in Blazor Server.

Dec 02 '22 21:12 mkArtakMSFT

Jun 29 '23 22:06 ghost

Is there any progress on this? Are there any docs how to use tracing with a blazor server app? Regarding e.g.

A workaround we are experimenting with right now is setting Activity.Current = null in several key places, such as upon creating a new circuit and upon resolving specific early scoped dependencies (per circuit).

How to get these workarounds right? Where to set Activity.Current?

Wouldn't it be pragmatic not to create a new Activity for every WebSocket message but rather for every navigation using NavigationManager? I guess this is somewhat opinionated so it might not be a framework-wide solution, but my gut feeling is that it would be better than having a single Activity per Circuit lifetime:

It's less likely that Spans live for hours, because users who use an app for a longer period of time might still navigate to another page.
It's easy to understand what the Span lifetime means.
It's somewhat similar to how it would work with a traditional request-based webapp ("one Span per page visit").

I think it'd be great to have some docs regarding Blazor Server and tracing, maybe with examples how to

disable the Circuit-long Span altogether (I guess that's what Activity.Current = null in the right place achieves)
end the current Span and start a new one based on custom logic, e.g. in NavigationManager.LocationChanged
- bonus: add the circuit id as a tag to the new span so that it's still possible to correlate

Aug 24 '23 13:08 georg-jung

+1 Would really appreciate some guidance here in the meantime. It's a real customer problem today.

Have some long running Blazor Server dashboard pages, and my Application Insights is a mess. One 'Operation Id' is overloaded with so much it's hard to find what I'm looking for in there. As the API see's it, these really are separate operations but the Blazor app insists on stuffing the same traceid header onto every one.

If the default behaviour is not as I expect, I'd like to know how I can conditionally control this on the Blazor Server side. For example something I could do with the injected IHttpClientFactory or HttpClient itself to disconnect it from the ambient 'activity'.

Sep 19 '23 11:09 vivaladan

aspnetcore aspnetcore copied to clipboard

Blazor Server / SignalR / WebSockets takes distributed tracing to strange new places

Describe the bug

To Reproduce

aspnetcore
aspnetcore copied to clipboard