venice icon indicating copy to clipboard operation
venice copied to clipboard

[tc][fc][dvc] Add client availability/latency metrics in Otel

Open m-nagarajan opened this issue 9 months ago • 1 comments

Problem Statement

Add client availability/latency metrics in Otel on top of the existing tehuti metrics under existing Otel configs in VeniceMetricsConfig

Solution

2 solutions can be considered:

  1. name of the metrics be venice.client.<> and add a dimension venice.client.type with values thin_client, fast_client, da_vinci_client, etc. This requires a metric with the same name across clients (eg: venice.client.call_count) to have the same set of dimensions.
  2. name of the metrics be venice.thin_client.<> , venice.fast_client.<>, venice.da_vinci_client.<>, etc with no new dimension for the type.

This draft PR follows the first approach to be able to group things together easily:

venice.client.call_count
venice.client.call_time
venice.client.key_count

All these metrics are currently in a single class BasicClientStats.java, this PR resuses this class by also adding unhealthy_ latency_metrics to it (from ClientStats.java). This means DVC will be emitting this metric as well from now on.

Right now, I am exploring whether 2nd option might be a better alternative or is it okay to continue with 1st. Will update the desc and code accordingly which will be a simple change.

Other existing tehuti metric like request, request_key_count, success_request_ratio, success_request_key_ratio will be derived metrics in OTel based on venice.client.key_count metric and its dimension venice.response.status_code_category (success/fail)

Added setOtelAdditionalMetricsReader in VeniceMetricsConfig to be able to pass inInMemoryMetricReader from io.opentelemetry:opentelemetry-sdk-testing to be able to test the otel metrics.

Code changes

  • [x] Added new code behind a config. If so list the config names and their default values in the PR description.
  • [ ] Introduced new log lines.
    • [ ] Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • [ ] Code has no race conditions or thread safety issues.
  • [ ] Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • [ ] No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • [ ] Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • [ ] Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

TBD

  • [x] New unit tests added.
  • [ ] New integration tests added.
  • [x] Modified or extended existing tests.
  • [ ] Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • [ ] No. You can skip the rest of this section.
  • [x] Yes. Clearly explain the behavior change and its impact. Clients support emitting these new metrics in OTel

m-nagarajan avatar Apr 11 '25 23:04 m-nagarajan

[common] is an invalid commit summary prefix tag. Please fix prior to merging.

Valid tags are those of the deployables and user libraries affected by the change (i.e. including all those calling into the modified venice-common code).

Thanks.

FelixGV avatar Apr 17 '25 22:04 FelixGV