semantic-conventions URL query string values should be redacted by default

trafficstars

Fixes #860

Changes

Query string values are now redacted by default due to concerns around leaking sensitive data.

This is not considered a breaking change because the OTel semantic conventions definition of stability specifically allows changing attribute values:

Things not listed in the above are not expected to remain stable via semantic convention and are allowed (or expected) to change. A few examples:

The values of attributes

Apr 24 '24 14:04 trask

I would be more comfortable with keeping the current language and providing an opt-in redaction processor so as not to break existing users.

Apr 25 '24 17:04 austinlparker

I think it's fair to leave the registry docs the same as they were and put the opt-out redaction as an augmented description for http semconv.

Apr 25 '24 18:04 jsuereth

Is there any chance that you provide environmental variable name as an option to disable redaction? Without this I expect couple of different implementations in various languages.

Apr 25 '24 18:04 Kielek

Is there any chance that you provide environmental variable name as an option to disable redaction? Without this I expect couple of different implementations in various languages.

Based on this point from @Kielek a broader proposal: we will run into questions around sensitivity again and again, and right now it is solved selectively via inline comments in the specification (there are other ones for enduser.* and db.* properties), while this needs a broader solution. I suggest that such a solution could be designed in a way that it is decoupled from the attribute requirements and by that "solves" this problem. Here is a rough outline:

The semantic convention is kept as is (here: collect url.query by default)
There are "privacy requirement profiles" that end-users can/must provide when running OpenTelemetry SDKs. There is a "non-production" profile that collects everything and is not applying any redaction/filtering/etc. There is a "production" profile that applies certain level of redaction/filtering, and there may be other profiles for stricter requirements (e.g. in regulated environments, or for regions with stronger privacy legislation, etc.)
Those profiles can be provided via an environment variable as @Kielek suggested.
The filtering is applied before the telemetry leaves via an exporter.

I anticipate that this comes with a set of downsides (SDKs need to implement that filtering,redaction,etc., / those processes may be resource incentive for a busy node / ...), but this is something that in my experience end-user will ask for eventually anyhow (at least from my indirect experience with APM agents). Off-loading this to the collector may be the preferred solution, but depending on the privacy requirements that might not even be "good enough" (again, from my experience with APM agents, end-users in certain environments expect the data to be sanitized before leaving the application)

Apr 25 '24 19:04 svrnm

The semantic convention is kept as is (here: collect url.query by default)

It is important for native instrumentations to redact sensitive data or make users opt into the collection. For example, when writing logs, OTel is just one of possible logging providers and there are no guarantees secrets (if any) will be scrubbed by something in the pipeline.

I can see the world where:

instrumentations should redact by default. It's configurable
OTel SDK/distro has means to disable redaction depending on the profile it was able to detect

This allows instrumentations to protect themselves by having safe default (if my Azure SDK instrumentation leaked a credential, I'm responsible, not the OTel). If the distro has user permission or is brave enough, it can enable collection for all instrumentations.

Apr 26 '24 01:04 lmolkova

The semantic convention is kept as is (here: collect url.query by default)

It is important for native instrumentations to redact sensitive data or make users opt into the collection. For example, when writing logs, OTel is just one of possible logging providers and there are no guarantees secrets (if any) will be scrubbed by something in the pipeline.

I can see the world where:
* instrumentations should redact by default. It's configurable

* OTel SDK/distro has means to disable redaction depending on the profile it was able to detect
This allows instrumentations to protect themselves by having safe default (if my Azure SDK instrumentation leaked a credential, I'm responsible, not the OTel). If the distro has user permission or is brave enough, it can enable collection for all instrumentations.

I disagree: instrumentation libraries and native instrumentations should not carry the responsibility to redact sensitive data or provide users to opt into the collection from an OpenTelemetry perspective. If they use other logging providers or if they provide APIs to other tracing/metric providers, they can (and may be somtimes have to) do their own redaction, but that's not what we (OpenTelemetry) should expect them to do.

I argue that filtering and redaction should be a responsibility of the otel SDK and in no way the instrumentation library or the API should be expected to do the filtering:

The opentelemetry API by default is "noop", that means if it carries the responsibility for filtering and redaction it is either done "for nothing" or the availability of the SDK needs to be checked. I think this situation is comparable to sampling. The API might provide some properties that can help to identify if telemetry needs to be redacted/filtered, but the SDK needs to do the work.
For the reasons stated above the instrumentation library should also not carry that responsibility (again, they can, but they shouldn't be expected to do so). The argument for "who is responsible" works in both directions: if the azure SDK leaks credentials you are responsible, but if all HTTP instrumentation libraries following the otel standard leak credentials, the otel community is responsible.
A weaker argument in the same context is that responsible library authors implementing opentelemetry might be more carless around that sensitive data and either not think about it or expect opentelemetry to treat that data properly.
At the end the responsibility to protect sensitive data lays by the end-user consuming the instrumentation library/native instrumentation and the OpenTelemetry API/SDK in their application. Furthermore they are the only ones who have a clear understanding of their privacy needs, e.g. a purely internal application may be fine with collecting all the telemetry, an app in a staging/pre-prod environment as well, but an application that runs somewhere in production needs more data to be redacted, and if the application is used in regulated environments (medical, governmental, EU:-P, ...) that need is even higher.

To double down on this: I think that sampling and redaction share a lot of requirements, and therefore redaction should live in the Tracing/Logging/Metrics/Profiling SDK. There may be different strategies for redaction as well and there is room for innovation (see probabilistic sampling) and with the development of technology (and here: changes of legislation) requirements may change independent of the semantic conventions and specification

Apr 26 '24 08:04 svrnm

@svrnm are you saying that database queries should also not be redacted by default?

also, this PR is about security (leaking credentials) and not PII data, so I believe it applies to both production and non-production environments

Apr 26 '24 14:04 trask

@svrnm are you saying that database queries should also not be redacted by default?

What I am saying is that the approach we take for sensitive data redaction needs to be approached differently. Right now we have places in the spec that say "redact this data" or "do not collect this data", but no guidance on where and how this data should be treated accordingly. Right now it seems that this is a responsibility of the instrumenting library, so each HTTP instrumentation library will need to manage the redaction of url.query and provide configuration for that. That does not seem right to me and also not to fit into end-user expectations. If my application uses 12 different libraries as dependency, HTTP Client, HTTP Server, DB Client, etc.) I am expected to provide 12 different configurations for data redaction?

also, this PR is about security (leaking credentials) and not PII data, so I believe it applies to both production and non-production environments

I understand, but at the end it is about both. I also understand that this issue here is urgent and might need to be merged regardless to mitigate the security bug, but it it's still not clear then who is responsible for complying with this default?

Apr 26 '24 15:04 svrnm

I anticipate that this comes with a set of downsides (SDKs need to implement that filtering,redaction,etc., / those processes may be resource incentive for a busy node / ...), but this is something that in my experience end-user will ask for eventually anyhow (at least from my indirect experience with APM agents).

Yes, this. The SDK needs to build this in, and we should make a best effort to redact sensitive strings (we should probably have a way for instrumentations to communicate to the SDK what is sensitive, as well), but we cannot simply make a breaking change in one of the most widely used instrumentation paths like this, especially given that I think the majority of the values passed in through query strings in the world today are not sensitive.

I would suggest a staged rollout for these profiles as well; Perhaps initially we print a warning on initialization that query strings are passed without redaction and in the future this will change? Stage 2 would be an opt-in redaction processor, Stage 3 is redaction by default?

Apr 26 '24 15:04 austinlparker

Some rough thinking:

In many cases, being secure, being compliant and being able to observe things have conflict of interests (for example, https://github.com/open-telemetry/semantic-conventions/pull/961#discussion_r1579736370), the perfect solution might not exist, the balance keeps changing due to the changes in the surrounding environments (e.g. it changed a lot when GDPR came).
For people coming from different industry/domain, the definition of "what is a good balance" itself is very subjective.
The default behavior should set most users for success without them doing any extra things (e.g. the default "profile" will do proper redaction for the most common problems).
For users who want to change the default behavior, OpenTelemetry should provide a consistent way instead of let the user struggle in a jungle of 20+ instrumentation libraries where each have different ways of doing redaction.
There are also cases when the user wants to do special customization for a particular library rather than applying a rule to all the instrumentation libraries, or even a particular exporter path rather than all exporters (e.g. I want to redact the user email address when I exporter data to my normal logging backend, meanwhile I need to keep user email address for critical audit logs per requirement from the National Security Agency).
Supporting 4 & 5 will take time due to the size of the scope and the fact that people with different background have different understanding/positions, meanwhile we can decide either to wait for now or do something accepting that we're not perfect (and probably will never be perfect), as long as we know that OpenTelemetry can evolve.

Apr 26 '24 17:04 reyang

The SDK needs to build this in, and we should make a best effort to redact sensitive strings

This could also be a dangerous direction - for example, the pattern keeps changing when it comes "what looks like a credential leak", doing it inside the SDK has serviceability challenges (requiring the service to update dependency and redeploy). Either users don't use it, or they use it and have to do frequent patches in order to keep up with the latest rules, or we'll have to invent a system where these rules can be updated dynamically via some configuration.

Apr 26 '24 17:04 reyang

Leaving the defaults and responsibilities aside for a moment.

I would be interested to learn from vendors/instrumentation authors on

previous users complaints about sensitive data/potential leaks
do they redact in their legacy products or otel-based offerings

My list of anecdotal evidence:

users complaining about secrets in the URLs: https://github.com/microsoft/ApplicationInsights-dotnet/issues/2548, https://github.com/microsoft/ApplicationInsights-dotnet/issues/1877
instrumentations that disable/redact query params by default: Azure SDK on traces (unfortunately inconsistently across languages), ASP.NET Core for HTTP client and server logs

Random links from the CNCF blog https://www.cncf.io/blog/2023/02/07/migrating-from-opentracing-to-opentelemetry/

Oxeye’s research team discovered several scenarios where sensitive data was leaked through tracing and telemetry collection within cloud-native applications. They found one that belonged to a leading online payment services company among the many deployments.

https://www.oxeye.io/resources/how-insecure-application-tracing-and-telemetry-may-lead-to-sensitive-data-and-pii-leakage

Apr 26 '24 17:04 lmolkova

The SDK needs to build this in, and we should make a best effort to redact sensitive strings

This could also be a dangerous direction - for example, the pattern keeps changing when it comes "what looks like a credential leak", doing it inside the SDK has serviceability challenges (requiring the service to update dependency and redeploy). Either users don't use it, or they use it and have to do frequent patches in order to keep up with the latest rules, or we'll have to invent a system where these rules can be updated dynamically via some configuration.

I mean, it could be a regular expression (or even some simple heurestics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I would point out that Datadog offers two options - redact query string, and redact paths with digits (https://docs.datadoghq.com/tracing/configure_data_security/?tab=http#trace-obfuscation) but they are both disabled by default. In addition, as other comments have pointed out, we ship a system for redaction today, it's in the collector, and our official guidance on production deploys is to use a collector.

I am not arguing that we shouldn't have a redaction option, I am arguing that we should not default it to 'on'. I would point out that this proposed new behavior was opposed by comments from the community on the .NET repo where it originated (https://github.com/open-telemetry/opentelemetry-dotnet/pull/5532#issuecomment-2056788462).

edit: another example from the datadog docs, specifically around their library: https://docs.datadoghq.com/tracing/configure_data_security/?tab=http#library. they redact 'suspicious-looking' values through a regex, the default of which is below:

(?:(?:"|%22)?)(?:(?:old[-_]?|new[-_]?)?p(?:ass)?w(?:or)?d(?:1|2)?|pass(?:[-_]?phrase)?|secret|(?:api[-_]?|private[-_]?|public[-_]?|access[-_]?|secret[-_]?|app(?:lication)?[-_]?)key(?:[-_]?id)?|token|consumer[-_]?(?:id|key|secret)|sign(?:ed|ature)?|auth(?:entication|orization)?)(?:(?:\s|%20)*(?:=|%3D)[^&]+|(?:"|%22)(?:\s|%20)*(?::|%3A)(?:\s|%20)*(?:"|%22)(?:%2[^2]|%[^2]|[^"%])+(?:"|%22))|(?:bearer(?:\s|%20)+[a-z0-9._\-]+|token(?::|%3A)[a-z0-9]{13}|gh[opsu]_[0-9a-zA-Z]{36}|ey[I-L](?:[\w=-]|%3D)+\.ey[I-L](?:[\w=-]|%3D)+(?:\.(?:[\w.+/=-]|%3D|%2F|%2B)+)?|-{5}BEGIN(?:[a-z\s]|%20)+PRIVATE(?:\s|%20)KEY-{5}[^\-]+-{5}END(?:[a-z\s]|%20)+PRIVATE(?:\s|%20)KEY(?:-{5})?(?:\n|%0A)?|(?:ssh-(?:rsa|dss)|ecdsa-[a-z0-9]+-[a-z0-9]+)(?:\s|%20|%09)+(?:[a-z0-9/.+]|%2F|%5C|%2B){100,}(?:=|%3D)*(?:(?:\s|%20|%09)+[a-z0-9._-]+)?)

Apr 26 '24 17:04 austinlparker

@lmolkova

I would be interested to learn from vendors/instrumentation authors on

My experience in the past ~3 years in my capacity working for a vendor (Honeycomb) has been a mix:

Vast majority of customers have no issues with this at all. They recognize that debugging information means there may be sensitive information, including possible credentials. SOC II compliance, encryption at rest and in transit, the ability to sign a DPA, and auditable, proper least-privilege controls for their data are all table stakes.
Most customers who operate on data with a degree of sensitivity typically have their own, internal mechanisms to ensure nothing bad leaks. Things still leak of course, and when it happens they ask us to delete the data at the requested level of granularity. Note that the 60-day retention period of data (after which data is deleted permanently) alleviates many of these concerns by default.
Enterprise customers dealing with this kind of stuff -- when they know it's "a thing" -- ask what their options are for data redaction as a last-step measure before egress from their own network at various stages, but it's typically before signing up to instrument beyond a POC. Thus far, the redactionprocessor has been sufficient for all but one customer.
One customer is extremely careful about privacy and has built their own encryption proxy that ensures no real names ever leave their network, and uses another tool to decrypt values in the browser when using Honeycomb.
I have spoken with one customer who was concerned about how their URL representations could leak sensitive information, both in query params and in the URL itself. They acknowledged this is very much a "we have a lot of work to do to clean things up ourselves" situation, and wanted to make sure there was some consumable information about how to redact information, including how to configure specific instrumentations so that sensitive data doesn't get leaked to begin with.

Apr 26 '24 17:04 cartermp

Vast majority of customers have no issues with this at all. They recognize that debugging information means there may be sensitive information, including possible credentials. SOC II compliance, encryption at rest and in transit, the ability to sign a DPA, and auditable, proper least-privilege controls for their data are all table stakes.

I would also add that in five years at Lightstep I don't recall anyone having an issue with this. Any time we did have a customer send PII or other sensitive data, we had ways for them to delete it, and we offered redaction as a part of our collection pipeline for known sensitive keys/strings.

Apr 26 '24 18:04 austinlparker

I mean, it could be a regular expression (or even some simple heuristics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I'd love to be able to do this.

Could we all agree with something like:

this set of query params (other properties) is safe, they should be on by default
this set of query params contains secrets for sure, they should be off by default (and maybe don't even have to be configurable)
For the rest, we want SDK/distro to be able to apply a common default

?

Apr 26 '24 18:04 lmolkova

I mean, it could be a regular expression (or even some simple heuristics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I'd love to be able to do this.

Could we all agree with something like:

this set of query params (other properties) is safe, they should be on by default

this set of query params contains secrets for sure, they should be off by default (and maybe don't even have to be configurable)

For the rest, we want SDK/distro to be able to apply a common default

?

I would support this.

Apr 26 '24 18:04 austinlparker

I mean, it could be a regular expression (or even some simple heuristics -- the provided examples of Azure/GCP/AWS calls with tokens in query strings, for example).

I'd love to be able to do this. Could we all agree with something like:

this set of query params (other properties) is safe, they should be on by default

this set of query params contains secrets for sure, they should be off by default (and maybe don't even have to be configurable)

For the rest, we want SDK/distro to be able to apply a common default

?

I would support this.

I've sent a PR to motivate further discussion of this proposal: #971

Apr 26 '24 18:04 trask

This PR was marked stale due to lack of activity. It will be closed in 7 days.

May 12 '24 03:05 github-actions[bot]

Commenting to unstale

May 13 '24 06:05 svrnm

This PR was marked stale due to lack of activity. It will be closed in 7 days.

May 29 '24 03:05 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Jun 05 '24 03:06 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Jun 21 '24 03:06 github-actions[bot]

What is blocking this? Making this optional and configurable in a natural way in the .net instrumentation is blocked by this issue being in limbo. https://github.com/open-telemetry/opentelemetry-dotnet-contrib/issues/1954#issuecomment-2231884662

Sep 18 '24 23:09 Wraith2

What is blocking this? Making this optional and configurable in a natural way in the .net instrumentation is blocked by this issue being in limbo. open-telemetry/opentelemetry-dotnet-contrib#1954 (comment)

This PR has lot of objections (judging from the discussions, thumbs-down reactions etc.). A scaled down version is here : https://github.com/open-telemetry/semantic-conventions/pull/971 which has less opposition from what I can tell!

Sep 19 '24 15:09 cijothomas

Superceded by #971

Nov 10 '24 23:11 trask

semantic-conventions semantic-conventions copied to clipboard

URL query string values should be redacted by default

Changes

semantic-conventions
semantic-conventions copied to clipboard