opentelemetry-specification icon indicating copy to clipboard operation
opentelemetry-specification copied to clipboard

Add feature flagging semantic conventions

Open beeme1mr opened this issue 2 years ago • 58 comments

Resolves https://github.com/open-telemetry/opentelemetry-specification/issues/2532

Changes

Added semantic conventions for representing feature flags as spans. This can be used to determine the impact a feature has on a request.

Related issues

  • https://github.com/open-telemetry/opentelemetry-specification/issues/2532
  • https://github.com/open-feature/spec/issues/30

beeme1mr avatar May 10 '22 21:05 beeme1mr

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: beeme1mr / name: Michael Beemer (f1aa8732f2059f6051fc7ce0ed542292a11c2602)

This needs an issue in this repository (a reference to an issue in another repository is useful but not sufficient) and then follow the issue workflow. Ideally this should have happened before the PR is created so that we know we want it in the OpenTelemetry.

tigrannajaryan avatar May 10 '22 22:05 tigrannajaryan

This needs an issue in this repository (a reference to an issue in another repository is useful but not sufficient) and then follow the issue workflow. Ideally this should have happened before the PR is created so that we know we want it in the OpenTelemetry.

You can wait a bit and see if there are approvers commenting on this PR who think the PR is a good idea then we can move forward without an issue.

tigrannajaryan avatar May 10 '22 22:05 tigrannajaryan

Hi @tigrannajaryan, sorry for not following protocol. I'm happy to close this PR until it has gone through the proper channels. However, I'm not sure what the proper channel is. The contributing doc that you sent says "significate changes" require an OTEP and that small changes can be made directly via a PR. This change seems to fall somewhere in-between those two extremes.

beeme1mr avatar May 11 '22 01:05 beeme1mr

I just noticed that you clarified the process a PR. I'll open an issue tomorrow. Thanks!

beeme1mr avatar May 11 '22 01:05 beeme1mr

May I get a short summary regarding why feature-flagging needs to be part of semantic conventions? Is there an example (I read through the linked issue and couldn't find a clear explanation)?

reyang avatar May 11 '22 06:05 reyang

May I get a short summary regarding why feature-flagging needs to be part of semantic conventions? Is there an example (I read through the linked issue and couldn't find a clear explanation)?

If feature flag keys and values are consistently reported, it will enable splitting traces and metrics by these values for usecases like A/B testing, feature rollouts, and similar. In order for this to work it must be consistent across all microservices, potentially in multiple languages and/or using multiple feature flag services.

dyladan avatar May 11 '22 13:05 dyladan

Hi @tigrannajaryan, sorry for not following protocol. I'm happy to close this PR until it has gone through the proper channels. However, I'm not sure what the proper channel is. The contributing doc that you sent says "significate changes" require an OTEP and that small changes can be made directly via a PR. This change seems to fall somewhere in-between those two extremes.

@beeme1mr no problem, we need the contributing documentation to be more clear. I think it is fine to keep this PR open.

tigrannajaryan avatar May 11 '22 14:05 tigrannajaryan

If feature flag keys and values are consistently reported, it will enable splitting traces and metrics by these values for usecases like A/B testing, feature rollouts, and similar. In order for this to work it must be consistent across all microservices, potentially in multiple languages and/or using multiple feature flag services.

If the user has System X calling into Cloud Provider Y which in turn calls back to System Z, and they use different A/B testing / feature flag system, what would be the expectation? Or this PR is specific about open-features and it's compatible systems, not about other systems?

reyang avatar May 12 '22 01:05 reyang

If feature flag keys and values are consistently reported, it will enable splitting traces and metrics by these values for usecases like A/B testing, feature rollouts, and similar. In order for this to work it must be consistent across all microservices, potentially in multiple languages and/or using multiple feature flag services.

If the user has System X calling into Cloud Provider Y which in turn calls back to System Z, and they use different A/B testing / feature flag system, what would be the expectation? Or this PR is specific about open-features and it's compatible systems, not about other systems?

They would just report span attributes consistent with the feature flag system they're using. The attributes are general enough to be used with any feature flag system. I'm not sure I see what the problem is.

dyladan avatar May 12 '22 12:05 dyladan

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar May 21 '22 03:05 github-actions[bot]

This is not stale

dyladan avatar May 24 '22 19:05 dyladan

Marking this as a draft until #2532 is resolved.

beeme1mr avatar May 24 '22 19:05 beeme1mr

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Jun 01 '22 03:06 github-actions[bot]

Not stale

dyladan avatar Jun 01 '22 16:06 dyladan

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Jun 10 '22 03:06 github-actions[bot]

Still waiting for an official response on #2532

beeme1mr avatar Jun 15 '22 12:06 beeme1mr

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Jun 29 '22 03:06 github-actions[bot]

I removed the Stale label to give more time for people to express their support at https://github.com/open-telemetry/opentelemetry-specification/issues/2532

tigrannajaryan avatar Jun 29 '22 13:06 tigrannajaryan

@beeme1mr

One thing that may be worth adding is a evaluation_reason or something similar. This would identify why a particular flag value was returned (some targeting rule was matched, a random split is configured, or a static flag value was assigned, for instance):

      - id: evaluation_reason
        type: string
        requirement_level: recommended
        brief: A string indicating why a particular variant was selected.
        examples: ["STATIC", "TARGETING_MATCH". "PSEUDORANDOM_SPLIT"]

Such a concept exists in OpenFeature as well as some vendors, for example: https://docs.launchdarkly.com/sdk/concepts/evaluation-reasons

toddbaert avatar Jul 11 '22 20:07 toddbaert

@beeme1mr

One thing that may be worth adding is a evaluation_reason or something similar. This would identify why a particular flag value was returned (some targeting rule was matched, a random split is configured, or a static flag value was assigned, for instance):

      - id: evaluation_reason
        type: string
        requirement_level: recommended
        brief: A string indicating why a particular variant was selected.
        examples: ["STATIC", "TARGETING_MATCH". "PSEUDORANDOM_SPLIT"]

Such a concept exists in OpenFeature as well as some vendors, for example: docs.launchdarkly.com/sdk/concepts/evaluation-reasons

I see a couple problems with this (although none that can't be overcome).

  1. I can't see what obvious value that provides. Knowing that a flag is true because it targets specific users is possible from just the key and the data stored in your feature flag system.
  2. I think this might be tough to implement in a way that makes sense across all tools/vendors. In order for the reason to be meaningful, each reason would have to have a definition and all vendors would have to agree on them. There is also a lot of potential crossover between different reasons. For example, if I split 50% of my traffic based on the hash of the user's email, is that TARGETING_MATCH or is that PSEUDORANDOM_SPLIT?

I think in this case I would advocate to get the basic functionality already in the PR merged, then see what users ask for when they start actually consuming telemetry data.

dyladan avatar Jul 11 '22 20:07 dyladan

  1. I can't see what obvious value that provides. Knowing that a flag is true because it targets specific users is possible from just the key and the data stored in your feature flag system.

This is true, but it might be tedious to cross reference such configuration when viewing telemetry data. The reason can help make it obvious that some specific targeting occurred "at a glance", perhaps in a situation when a static value was expected (maybe somebody forgot to remove some specific targeting that was added for debug purposes).

Vendors thought this information was important enough to include it in their API, so I think it could be important for the semantic conventions.

  1. I think this might be tough to implement in a way that makes sense across all tools/vendors. In order for the reason to be meaningful, each reason would have to have a definition and all vendors would have to agree on them. There is also a lot of potential crossover between different reasons. For example, if I split 50% of my traffic based on the hash of the user's email, is that TARGETING_MATCH or is that PSEUDORANDOM_SPLIT?

I'm not sure that there needs to be a precise consensus on the possible values of this field across all providers for it to be valuable, but I see your point. Perhaps a lack of consistency here is a dealbreaker :shrug:

I think in this case I would advocate to get the basic functionality already in the PR merged, then see what users ask for when they start actually consuming telemetry data.

Fine with me!

toddbaert avatar Jul 11 '22 20:07 toddbaert

  1. I can't see what obvious value that provides. Knowing that a flag is true because it targets specific users is possible from just the key and the data stored in your feature flag system.

This is true, but it might be tedious to cross reference such configuration when viewing telemetry data. The reason can help make it obvious that some specific targeting occurred "at a glance", perhaps in a situation when a static value was expected (maybe somebody forgot to remove some specific targeting that was added for debug purposes).

¯\(ツ)/¯ the value there seems like a stretch to me, but maybe I'm wrong.

Vendors thought this information was important enough to include it in their API, so I think it could be important for the semantic conventions.

This is definitely not reason enough on its own. There are all sorts of things available from host/process/library APIs that we don't include in semconv.

Like I said. I think all of these things can be overcome, but it is a non-breaking addition and can always be included later if it becomes important.

dyladan avatar Jul 11 '22 20:07 dyladan

Hi @tigrannajaryan and @SergeyKanzhelev, do you have any additional questions or concerns I can address?

beeme1mr avatar Jul 20 '22 14:07 beeme1mr

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Jul 29 '22 03:07 github-actions[bot]

Removing the Stale label. It would be good to see the questions I raised here answered by this PR.

tigrannajaryan avatar Jul 29 '22 14:07 tigrannajaryan

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Aug 17 '22 03:08 github-actions[bot]

I am still not convinced the feature-flag data in the proposed form is useful, however I won't block the PR and if sufficient number of other approvers find it useful it can get merged.

tigrannajaryan avatar Aug 17 '22 15:08 tigrannajaryan

Would you envisage feature flag attributes appearing as a "resource" attribute rather than just a tracing span attribute?

We sometimes run experiments with the same source code run side-by-side in 2 deployments deployment (with randomly distributed A/B style user traffic), but often with different compilation options (e.g. -O3 vs. -O2), or perhaps running on JVM 11 vs. JVM 14. Usually this is for internal regression analysis/comparison of performance.

I could imagine creating a feature_flag.flag_key=jvm, feature_flag.evaluated_variant=jdk14 for such a situation and adding them to the resource, but I don't know if this is what is intended.

For JDK, it seems like process.runtime.name and process.runtime.version are probably sufficient , but there are other situations where we may fix the value of a feature_flag for a particular deployment, so that everything generated by that entity will have that feature_flag set to the fixed value, while our comparison deployment will have the feature flag set to off.

Personally for our case, I thought something like a service.variant (or perhaps just a suffixed service.version) would be appropriate given that service.* is already associated with Resources, but this PR caught my attention since it has some similarity in the semantics.

evantorrie avatar Aug 19 '22 21:08 evantorrie

Hi @evantorrie, you could certainly use this proposal for analyzing A/B tests. For example, you could select what Fibonacci algorithm to run using a feature flag. The results could be added to the trace as a dedicated span. In many tools, you can search for specific span name and value to see the full trace. This would allow you to compare the various states to see how the flag impacted the request.

Would you envisage feature flag attributes appearing as a "resource" attribute rather than just a tracing span attribute?

From my understanding, resource attributes are meant to be set when the system starts and should be immutable. A feature flag can be dynamically changed during runtime. The value can also be different based on evaluation context (i.e. enabled for a specific user).

beeme1mr avatar Aug 23 '22 13:08 beeme1mr