eventing icon indicating copy to clipboard operation
eventing copied to clipboard

How to handle multiple request schemes in metrics

Open Cali0707 opened this issue 1 year ago • 5 comments

Problem Sometimes, an event which is dispatched from the imc dispatcher might dispatch with mixed schemes. If there are one or more subscribers with https, then those events will be dispatched with https scheme. If there are one or more subscribers with http, then those events will be dispatched with http scheme. This can cause issues when it comes to recording metrics, because if we record each metric once for http and once for https then our event count will be wrong for the dispatcher (since some events will be recorded twice). Some ideas proposed by @maschmid to fix this are:

  1. Rather than have event_scheme="http" or event_scheme="https", tag the metrics with event_scheme_http=true/false and event_scheme_https=true/false.
  2. Do event_scheme="https" only if ALL subscribers are https, otherwise report http.

Persona: Which persona is this feature for? Administrators trying to observe the system

Exit Criteria The metric counts are always correct, even in TLS permissive

Time Estimate (optional): How many developer-days do you think this may take to resolve? 1

Additional context (optional) Add any other context about the feature request here.

Cali0707 avatar Apr 25 '24 14:04 Cali0707

My personal vote here is to go with option 1 from above, as that way:

  1. the event count will be correct
  2. I can create other aggregate views of information to see if e.g. https dispatches fail more frequently than http dispatches

Cali0707 avatar Apr 25 '24 14:04 Cali0707

cc @pierDipi @creydr @Leo6Leo any thoughts on what approach we should take here? One of the options above, some other option?

Cali0707 avatar Apr 25 '24 14:04 Cali0707

To me the problem is not how we tag metrics but maybe how it is implemented, I don't think the proposed tags in 1 are idiomatic, also queries similar to 2

I can create other aggregate views of information to see if e.g. https dispatches fail more frequently than http dispatches

seem hard to express in promql as we couldn't easily aggregate using sum by (event_scheme)

Question: where is the metrics problem, is it on ingress or is it on egress ? On ingress, when an event comes in event_scheme is based on the channel's ingress scheme, if it's on egress we need to count and emit metrics for each subscriber separately rather than as aggregate

pierDipi avatar Apr 29 '24 12:04 pierDipi

@pierDipi this metrics problem is on the egress for IMC. So, I guess the fix is to change how we count and emit metrics in the imc-dispatcher?

Cali0707 avatar Apr 29 '24 15:04 Cali0707

/assign

pierDipi avatar May 08 '24 08:05 pierDipi