apm icon indicating copy to clipboard operation
apm copied to clipboard

Traceparent header from outside Elastic

Open gregkalapos opened this issue 4 years ago • 25 comments

Problem statement

In the W3C TraceContext specification it's possible that a system is monitored by multiple tracing tools.

Example: A -> B -> C (Service A calls Service B which calls Service C).

Let's say in this case B is monitored by an Elastic APM Agent, but A isn't - let's say A is monitored by another tool which also implements TraceContext. In this case the other tool will create a traceparent header with a traceid which is outside Elastic and a parentid which is from a non-Elastic tool. In B the Elastic APM agent will read this, create a transaction, set the traceid and the parentid based on the traceparent header. Problem is, in Kibana, under traces, only traces are listed where the 1. transaction has a parentid which is null (feel free to correct me, but from what I see that's how it works now), therefore in such a scenario those traces won't show up under traces in Kibana.

Related, recent issue from .NET

Reason I open this issue is that in .NET the situation is more-common, you don't even need another tool for this to happen: In ASP.NET Core, since version 3, there is built-in trace propagation support (details here). In short: the framework itself creates a traceid and it propagates it through HTTP calls. By default it uses a header called Request-Id, which is ok, but it has opt-in support for TraceContext and users can turn on context propagation through the traceparent header. That way in our previous example if A is not monitored at all, but opted-in and B is monitored by Elastic APM, we have the same problem.

Questions

  • Do other agents potentially have an issue with this? From my understanding there is no way to know who set's the traceparent header, it could be easily another tool (that's in fact one of the motivation of the standard), or in case of .NET, the framework itself.
  • I feel the key problem here is that the definition of a trace in Kibana is that it starts with a transaction where the parentid is null. Do others also feel, this is something which is not compatible with the TraceContext standard and we maybe should reiterate this?
  • Related problem is when let's say B isn't monitored by Elastic APM, but A and C are - the parent id in C will point to something which is outside Elastic. Is this an issue for our UI?

gregkalapos avatar Jun 26 '20 18:06 gregkalapos

Reason I open this issue is that in .NET the situation is more-common, you don't even need another tool for this to happen: In ASP.NET Core, since version 3, there is built-in trace propagation support (details here). In short: the framework itself creates a traceid and it propagates it through HTTP calls. By default it uses a header called Request-Id, which is ok, but it has opt-in support for TraceContext and users can turn on context propagation through the traceparent header. That way in our previous example if A is not monitored at all, but opted-in and B is monitored by Elastic APM, we have the same problem.

Looking at the agent code a bit, it seems that when starting a Transaction we're creating a new Activity and setting the ASP.NET Core-created Activity as its parent. Is that right? It seems to me that the Activity created by ASP.NET Core should be used as the basis for Transactions, rather than as the parent.

i.e. if there's an Activity, don't create one, just use it: set the transaction ID to Activity.Current.SpanId, parent ID to Activity.Current.ParentId, trace ID to Activity.Current.TraceId. Also, if Activity.Parent == null (root activity), make a sampling decision and update the trace context flags.

Related problem is when let's say B isn't monitored by Elastic APM, but A and C are - the parent id in C will point to something which is outside Elastic. Is this an issue for our UI?

UI devs please correct me if I'm wrong, but I'm pretty sure this is not an issue. We don't currently require complete lineage in trace events; we just use parent/child relationships to set bounds on timestamps (e.g. child can't come before parent), and otherwise we show all events with the same trace ID. When "B" is missing, it'll just be a gap in the timeline.

The root transaction is a different matter - that one is treated specially, as you point out.

axw avatar Jun 29 '20 03:06 axw

@axw yes, my PR is going the direction you suggest here - I mostly agree with those points. I'll tag you and elaborate on it, for .NET we can solve most of the problems that can occur by that approach. More about this in https://github.com/elastic/apm-agent-dotnet/pull/888 - I don't want to confuse others by .NET here.

Nevertheless the problem when A is being monitored by another tool which sets traceparent and B being monitored by an Elastic APM agent, I think we still have a real problem - which is that the parentid points to something that is not in Elastic APM. (promise last note here on .NET) Furthermore even with my PR and your suggestion, it can still happen that we set transaction.parentid to an id which points to something which is outside Elastic causing the problem that on the UI under traces the trace won't show up.

gregkalapos avatar Jun 29 '20 09:06 gregkalapos

What has been written about the UI so far is true. One additional impact is that the View Full Trace on the transaction view works like the traces list page - it is only functional when a root trace exists per https://github.com/elastic/kibana/issues/33924. I don't think excluding trace root from the traces list is generally as a big of an issue, or easily solved, as the one with view full trace

graphaelli avatar Jun 29 '20 20:06 graphaelli

@gregkalapos sure, will follow up on the PR. I just wanted to understand how much this was a .NET issue vs. a more general one.

IMO, we should not significantly change our UI or solution to accommodate the trace root being monitored by something other than Elastic APM. What would be the alternative? No trace grouping at all? This has further ramifications, as we will rely upon trace grouping (i.e. grouping by trace root) for tail-based trace sampling. The crux of it is that for that features, we need a way of grouping/classifying traces.

I would suggest it's just a consequence users would have to live with if they want to monitor the root with something other than Elastic APM -- reduced features. Traces won't show up in the "traces" list, they won't be able to use tail-based sampling, and anything else that relies upon having an identifier for traces.

(Just my opinion, not trying to push anything here; I'd be interested to hear others' takes.)

axw avatar Jun 30 '20 02:06 axw

There are also situations that look like the trace root is in another system whereas the root is just coming in late or is lost. Having the non-root traces appear in the trace view could be confusing and may significantly increase the number of entries in the list.

I would suggest it's just a consequence users would have to live with if they want to monitor the root with something other than Elastic APM -- reduced features.

I see it similarly. I'm not saying we shouldn't strive to have great integration with other tracing systems but there will always a limit of how much you can do if you have traces in different systems. Although those traces wouldn't appear in the trace view, they do appear in the transactions view of the service.

felixbarny avatar Jun 30 '20 07:06 felixbarny

We discussed this in the last agent weekly;

Summary

There seemed to be an agreement that this is still an important issue, it’s realistic to assume that a big system is monitored by multiple products - and also cases are realistic where the 1. service is monitored by a different tool. Also there can be cases where A, B, and C are services from different companies calling each other - in all those cases the traces list which is at a fairly prominent place on the UI would potentially not contain those traces. All these scenarios were considered in W3C TraceContext and the issue is that we cannot find the root of the trace.

We agreed to not close this issue yet.

Options discussed:

  • Follow transactions based on parentid with the same traceid until the parentid points to something which is not in Elasticsearch (that’d be the other service monitored by the other product) and mark that transaction as trace root.
    • This is impossible to query efficiently in ES (no left joins etc)
    • Could be done as a background job
  • Use the tracestate header (part of W3C TraceContext) with that we’d know which one is the first service monitored by Elastic APM. That way we could just select the 1. transaction marked as “1. in Elastic APM” as trace root (and maybe also show that part of it is in another tool).
  • Force agent with a setting to always start a new trace, so its 1. transaction will be the root of the trace.
  • Somehow provide a way of saying that all transactions of Service X should be in the traces list as well.
    • Or offer some way to mark transactions as trace root
    • One potential issue here is that the deployment can change and if suddenly Elastic APM also monitors a service that was previously not monitored by Elastic, then such settings could potentially be outdated and need to be adapted.

gregkalapos avatar Jul 02 '20 10:07 gregkalapos

We've just run into this problem when switching CDN providers. During testing, I noticed none of the requests going through the CDN are getting sampled. It turned out BunnyCDN automatically adds a traceparent header to the requests, which the agent (node.js in this case) reads and follows the "not sampled" flag.

This seems rather problematic when you consider the header may be set even by end-users. A malicious user could just set the header and basically disappear from most of the analytics. The opposite scenario is also a problem and even mentioned in the spec, if a user can force the request to be sampled by setting the flag in the header, it can lead to DoS.

MartinKolarik avatar Aug 26 '21 18:08 MartinKolarik

I think a relatively easy fix for that would be to introduce a flag that allows to disregard the sampling decision from the incoming traceparent header. This would have to be set on the edge services. I don't think agents can and should auto-detect when to discard the sampling decision. Another question is whether the trace-id should be re-used in that case or not. It may be helpful for log correlation purposes. OTOH, it allows a malicious user to forge trace-id collisions. Ultimately, it's a tradeoff the user has to make. I think it makes sense for our agents to offer multiple sampling strategies.

felixbarny avatar Aug 30 '21 14:08 felixbarny

@felixbarny that would be an unfortunate solution as it would allow you to either have the problems mentioned above or not have distributed tracing - we do rely on the header in our services, actually, so turning it off completely is not a good solution. It would work if there was a clean separation of "edge" and "internal" services but not when you have services that are public, but may also call each other.

One of the ideas I had aftering thinking about this a little was using some kind of authentication within the requests. E.g.:

  • a shared secret is set in the participating services and a HMAC of the header is sent in the request, or the shared secret is sent directly in another header,
  • agents can be configured to disregard the trace parent header if the HMAC/secret doesn't match.

The secret may be set via the standard agent configuration channels including central config from Kibana. This way you can easily accept the headers from trusted services and ignore everything else.

MartinKolarik avatar Aug 30 '21 18:08 MartinKolarik

In the .NET Agent we made some progress on this issue - the problem @MartinKolarik describes sounds very similar to the problem we had in .NET. In that case .NET services without an APM Agent added the traceparent header with sampled=false (similar to BunnyCDN) resulting in unsampled transactions when downstream service with agent receives such calls.

We came up with a setting which is already released in the .NET Agent. Here is the doc of the setting. This is the PR.

Would that be something worth proposing for other agents as well?

gregkalapos avatar Aug 30 '21 21:08 gregkalapos

We came up with a setting which is already released in the .NET Agent. Here is the doc of the setting. This is the PR.

I believe it makes sense to have that as one of the options. It does solve the problem of not sampling things that you want to sample in an easy way. It doesn't solve the opposite DoS scenario though.

MartinKolarik avatar Aug 30 '21 21:08 MartinKolarik

@felixbarny that would be an unfortunate solution as it would allow you to either have the problems mentioned above or not have distributed tracing - we do rely on the header in our services, actually, so turning it off completely is not a good solution.

What I was proposing is to offer multiple sampling strategies. On the services that are on the entry of your system, you would configure the agent to ignore the sampling decision from the incoming traceparent header. All other services are configured to respect the sampling flag or fall back to probabilistic sampling when they initiate a trace (for example for background jobs).

What can complicate things is when there's a web frontend that initiates the trace. In that case, the backend server will likely want to respect the sampling decision of the RUM agent. But that leaves the door open for abuse. For similar reasons, a shared secret does not work in the case a RUM agent is involved: it would also need to know the shared secret which means it will be leaked to any client. If no RUM agent is involved and you're just providing HTTP APIs, for example, I think a better way than using a shared secret is to configure the agents on the entry services to disregard the sampling flag.

felixbarny avatar Aug 31 '21 08:08 felixbarny

On the services that are on the entry of your system, you would configure the agent to ignore the sampling decision from the incoming traceparent header.

The problem is this assumes a clear separation of entry services and the rest. It doesn't work in the following (and imo fairly common) case: you have services A and B, which are both public ("entry") but A also calls B.

MartinKolarik avatar Aug 31 '21 10:08 MartinKolarik

That's right, good point. In your infra, are all public requests routed trough a dedicated load balancer? If so, could you just drop the traceparent header there?

felixbarny avatar Aug 31 '21 11:08 felixbarny

No, in this case, the LB is the CDN (Bunny) and behind that is a heroku-like cloud platform. End users and our own other services both access it via the CDN.

MartinKolarik avatar Aug 31 '21 12:08 MartinKolarik

Seems there are lots of different use cases that depend on the specifics of each particular setup. There's no one-size-fits-all solution.

The different dimensions in determining whether to continue a trace or to resample are:

  • Is a request coming from a service that's monitored with another Elastic APM Agent (proxy for internal request) or from somewhere else?
  • Should the trace be restarted or continued (generate a new trace-id vs reusing the one from the incoming traceparent header)?
  • Should the sampled flag be ignored (resample the trace)?

To accommodate for all of these dimensions, we could introduce a trace_continuation_strategy configuration option.

  • continue_always: the current behavior - every incoming traceparent header determines the sampling decision and the trace-id. This will continue to be the default behavior.
  • restart_always: always ignores the traceparent header of incoming requests. A new trace-id will be generated and the sampling decision will be made based on transaction_sample_rate.
  • restart_non_elastic: ignores the traceparent header if the tracecontext header doesn't contain a es vendor entry. Security-wise, this is not fully bullet proof as a malicious user could still forge a request that looks like it's coming from another Elastic agent. But it avoids unwanted trace continuation of non-malicious clients.
  • resample_always: always ignores the sampling flag of an incoming traceparent header. The trace will continue to use the trace-id from the traceparent header but the sampling decision will be re-evaluated (based on transaction_sample_rate).
  • resample_non_elastic: ignores the sampling flag incoming traceparent headers if the tracecontext header doesn't contain a es vendor entry.

@MartinKolarik would the restart_non_elastic or resample_non_elastic strategy work for your use case?

felixbarny avatar Sep 07 '21 16:09 felixbarny

@felixbarny the mentioned strategies would work for the described problem since the bunny header isn't coming from elastic but there's one more issue I see:

For public APIs/services - even if we ignore intentionally malicious users and go with the restart_non_elastic option as described, any consumers of the API that happen to use elastic APM as well will be sending us valid headers without even knowing about it. Ideally, I'd really like to be able to keep this enabled for communication between our services but ignored for anything else - and "anything else" means other elastic users as well.

MartinKolarik avatar Sep 07 '21 16:09 MartinKolarik

any consumers of the API that happen to use elastic APM as well will be sending us valid headers without even knowing about it

Yeah, good point. The request coming from another Elastic APM monitored service is a rather poor proxy for internal call.

Let's rename restart_non_elastic to restart_external and resample_non_elastic to resample_external. By default, the semantics would be the same i.e. the agent would look at the presence of the es vendor key in tracestate.

But we could add another optional config deployment_name or something like that that can have a short alphanumeric identifier. If configured, agents will send it as part of the tracestate header and only if the value matches with the configured value of the downstream service, it will consider the request as internal and continue the trace.

It's basically your suggestion with the shared secret but we wouldn't be calling it a "secret" as it's just a short identifier and it may be exposed to frontend clients. It should be short (I'm thinking max 5-10 chars) because it will be sent with every request and may cause overhead. Also, the tracestate header has strict size limitations. Another reason is to not give a false sense of this being a security feature (as the key may have to be exposed to frontend clients).

felixbarny avatar Sep 08 '21 06:09 felixbarny

@felixbarny 👍 yes, that should work (as you said, it's basically my original suggestion without considering it a security feature).

MartinKolarik avatar Sep 08 '21 23:09 MartinKolarik

This came up again recently. Applications run via Google Cloud Run automatically get a traceparent header set. AFAICT one cannot disable that traceparent header. As well, "requests are sampled at a maximum rate of 0.1 requests per second for each container instance", which cannot be configured. The result is that an app running via Cloud Run and using an Elastic APM agent cannot control the transaction_sample_rate. As well, I suspect they are subject to the issue mentioned above:

Problem is, in Kibana, under traces, only traces are listed where the 1. transaction has a parentid which is null (feel free to correct me, but from what I see that's how it works now), therefore in such a scenario those traces won't show up under traces in Kibana.

I will probably try a quick experimental feature for the Node.js APM agent to do either (a) a TraceContextIgnoreSampledFalse config var as Greg mentioned is in the .NET agent or (b) a limited first attempt at trace_continuation_strategy that Felix outlined above.

trentm avatar Jan 31 '22 23:01 trentm

I think we can simplify the trace_continuation_strategy when adding support for span links. Instead of differentiating between restart and resample, we would only have restart but also add a span link based on the traceparent header.

The issue is that we don't have support for span links for intake v2, yet. However, I think we can easily add that feature once we support span links.

This is the simplified set of options:

  • continue_always
  • restart_always
  • restart_external: By default, checks if es vendor key is set in tracestate. If project.name is set, propagates p:${project.name} and restarts if the project name from the incoming tracestate header doesn't match the configured one.

felixbarny avatar Feb 01 '22 07:02 felixbarny

  • restart_external: By default, checks if es vendor key is set in tracestate. If project.name is set, propagates p:${project.name} and restarts if the project name from the incoming tracestate header doesn't match the configured one.

@felixbarny project.name is an APM agent config option? (A new name for what you called deployment_name above?)

trentm avatar Feb 01 '22 16:02 trentm

@felixbarny project.name is an APM agent config option? (A new name for what you called deployment_name above?)

Yes. I was trying out project_name vs deployment_name but meant the same thing. No strong opinion on the actual name.

felixbarny avatar Feb 01 '22 17:02 felixbarny

FYI: I have a draft implementation of a subset of this for the Node.js agent here: https://github.com/elastic/apm-agent-nodejs/pull/2555 Currently it doesn't yet implement the restart_external option -- mainly because my potential dogfooder (kibana-ci-stats) doesn't need that option. Also because I don't know anything about span links yet. :)

trentm avatar Feb 01 '22 19:02 trentm

Can we close this, now that #596 is closed?

trentm avatar Apr 29 '22 20:04 trentm

I'm here today because of missing "dependencies" after moving an app to google cloud run, https://discuss.elastic.co/t/dependencies-dont-show-up-in-apm/288244/4 the workaround fixed it for me. I'm on v2.2.0 of apm and apmgin

roffe avatar Nov 07 '22 18:11 roffe

Can we close this, now that https://github.com/elastic/apm/issues/596 is closed?

Yes, I think the issues described here are addressed by that - I'm closing this.

I'm here today because of missing "dependencies" after moving an app to google cloud run, https://discuss.elastic.co/t/dependencies-dont-show-up-in-apm/288244/4 the workaround fixed it for me. I'm on v2.2.0 of apm and apmgin

@roffe it seems the feature which addresses this issue already landed in the Go agent: https://github.com/elastic/apm-agent-go/issues/1221

That's included in v2.2.0. This means you can just flip the new trace_continuation_strategy config to restart_external (or restart - I'm not familiar with your setup...) and you won't need the workaround anymore.

gregkalapos avatar Nov 08 '22 18:11 gregkalapos

This means you can just flip the new trace_continuation_strategy config to restart_external (or restart - I'm not familiar with your setup...) and you won't need the workaround anymore.

How would I go about to do this in ampgin middleware?

roffe avatar Nov 08 '22 18:11 roffe

This means you can just flip the new trace_continuation_strategy config to restart_external (or restart - I'm not familiar with your setup...) and you won't need the workaround anymore.

How would I go about to do this in ampgin middleware?

Options listed here.

gregkalapos avatar Nov 08 '22 18:11 gregkalapos

ah, ok so not configurable with env variables and it seems I need to create create a new middleware. im guessing it's the "In code, using the Tracer Config API" part you are refering to. gotcha 👍🏻

roffe avatar Nov 08 '22 18:11 roffe