opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

[Metric API] Instrument validation error handling

Open cijothomas opened this issue 9 months ago • 10 comments

Noticed an inconsistency in the Metric API regarding the init and try_init methods for instrument initialization. The documentation for init says that it would panic when validation fails, while try_init is expected to return a Result, allowing for error handling.

However, it appears that neither method behaves as documented. Specifically, the SDK implementation seems to suppress the Err variant, invariably returning an Ok variant after logging an error, and continues to provide a valid instrument. This should be potentially be classified as a bug.

Proposing the following changes to address this:

  • Modify the SDK implementation to return a NoOp instrument instead of an actual instrument when validation fails.

Additionally, need some inputs/suggestions of the below:

  1. Considering that the Err variant is never returned, should we remove the try_init method?
  2. Alternatively, should we retain both init and try_init, ensuring they adhere to the current documentation—where init panics on invalid input, and try_init returns a Result? This would place the onus on the user to either handle the results or accept the possibility of a panic.

cijothomas avatar May 01 '24 19:05 cijothomas

Proposal: (briefly covered this in community call today)

  1. Rename init to build()
  2. There is no need of a try_build()
  3. build() to always return an Instrument. I am unsure of the value of returning Result as there isn't anything user can do about it at runtime!
  4. If there is validation error (invalid characters in name etc.), still return an instrument but with a no-op behavior, and emit ERROR level log.

@lalitb @TommyCpp @utpilla Does the above proposal looks good? This is important blocker for moving metrics API to RC and then stable.

cijothomas avatar Oct 02 '24 00:10 cijothomas

I am unsure of the value of returning Result as there isn't anything user can do about it at runtime

If there is validation error (invalid characters in name etc.), still return an instrument but with a no-op behavior, and emit ERROR level log

It seems we are deciding for users that if the instrument builder finds a error, we will log the error and return noop instrument.

I see values if we provide try_build and build so that users can choose as they like. In general as library we shouldn't assume how users want to handle errors.

I am good if we start with build as described above and leave try_build as follow up should the need arises. It won't be a breaking change

TommyCpp avatar Oct 02 '24 01:10 TommyCpp

What is the expected usage pattern with try_build that returns a Result? if a user gets Error Result, what are they expected to do with it? Show an example so we can use that to discuss.

Spec is intentionally explicit that whether noopinstrument or realinstrument is returned is unspecified, and leaves upto implementations, so we can decide the behavior and stick with it.

(we have to solve this before stable as this is a public API.)

cijothomas avatar Oct 02 '24 01:10 cijothomas

if a user gets Error Result, what are they expected to do with it? Show an example so we can use that to discuss.

What if I don't want to log this as an ERROR but instead of as a WARN? Users may have alerts based on number of ERROR logs and having metrics init in it doesn't makes sense

(we have to solve this before stable as this is a public API.)

Let's say we make the build API behavior as suggested. Why it's a breaking change for us to add try_build method to builders. It doesn't change the build method and existing use cases.

TommyCpp avatar Oct 02 '24 01:10 TommyCpp

What if I don't want to log this as an ERROR but instead of as a WARN? Users may have alerts based on number of ERROR logs and having metrics init in it doesn't makes sense

As per spec wording, we should notify with an error. It does not literally mean a log severity error, but I don't see anything wrong with error severity, as this is clearly a user error, and if they have alert based on Error logs, then perfect - they'll know something is off, and the error log will tell that "opentelemetry-sdk is ignoring your instrument as it is invalid due to xyz, and they can fix it". (Though Warning maybe sufficient, i think anytime there is data loss, Error makes more sense IMHO)

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#instrument-name If the instrument name does not conform to this syntax, the Meter SHOULD emit an error notifying the user about the invalid name. It is left unspecified if a valid instrument is also returne

cijothomas avatar Oct 02 '24 01:10 cijothomas

Let's say we make the build API behavior as suggested. Why it's a breaking change for us to add try_build method to builders. It doesn't change the build method and existing use cases.

You are right! It is not a breaking change to add a new method try_build(). This means we should remove try_build() before RC/Stable. Also, we cannot change the behavior of build() after stable. if build() was previously retuning noop instrument in case of invalid name, we cannot change that behavior as it is considered breaking change. (the vice-versa maybe considered okay)

cijothomas avatar Oct 02 '24 01:10 cijothomas

This means we should remove try_build() before RC/Stable.

Yeah I think that make sense and I am aligned

This means we should remove try_build() before RC/Stable.

We won't right? It will output ERROR logs and return noop instrumentation. If some days users open issues saying they want to custom behavior when instrumentation fails, we can then offer try_build which will return a Result<T, E> instead of a noop

TommyCpp avatar Oct 02 '24 01:10 TommyCpp

if they have alert based on Error logs, then perfect - they'll know something is off, and the error log will tell that "opentelemetry-sdk is ignoring your instrument as it is invalid due to xyz, and they can fix it".

Think about it this way, if I have a paging alarm on number of logs >= ERROR. Say someone pushed a change that breaks a new instrumentation. I don't think it worth waking the oncall on 2 A.M. Personally I don't think that's the correct way to setup alarms(metrics would be better) but I have seen it before

TommyCpp avatar Oct 02 '24 01:10 TommyCpp

if they have alert based on Error logs, then perfect - they'll know something is off, and the error log will tell that "opentelemetry-sdk is ignoring your instrument as it is invalid due to xyz, and they can fix it".

Think about it this way, if I have a paging alarm on number of logs >= ERROR. Say someone pushed a change that breaks a new instrumentation. I don't think it worth waking the oncall on 2 A.M. Personally I don't think that's the correct way to setup alarms(metrics would be better) but I have seen it before

This is not a specific problem about InstrumentName validation. Every time otel emits internal log, user can hit this issue. Which is why we need to provide user the ability to decide what to do for each interval event. https://github.com/open-telemetry/opentelemetry-rust/pull/2128#issuecomment-2387423439 They can chose to ignore all logs with "target" starting with opentelemetry. Or they can provide a callback, and the eventname can be used to determine what they want to do. Eg: Instrument-Name-Invalid maybe ignoreable for some, but Export-Failed may not be!

Think about it this way, if I have a paging alarm on number of logs >= ERROR. Say someone pushed a change that breaks a new instrumentation. I don't think it worth waking the oncall on 2 A.M.

I see it differently. If their application will not get a particular metric that used to work before, then it is worth waking up oncall person. what it this instrument is their SLO metric? Anyway, this is not something we should dictate. OTel can do internal logging, and user can decide if they want to react to OTel logs or completely ignore or do something else with it.

cijothomas avatar Oct 02 '24 01:10 cijothomas

I see it differently. If their application will not get a particular metric that used to work before, then it is worth waking up oncall person. What it this instrument is their SLO metric?

As you point out the important of different metrics can be vastly different. So treating every metrics validation failure the same may not be ideal in every case. In this case I don't even think log filter can help as the callsite is the same(need to look into #2128 more to be sure). So it's possible someone will find try_build useful. But it's generally a niche use case and we don't need to resolve beforehead

TommyCpp avatar Oct 02 '24 01:10 TommyCpp