Expose traces via Open Telemetry to allow for distributed tracing for production use cases

Open salliewalecka opened this issue 3 years ago • 1 comments

Feature Request

If this is a feature request, please fill out the following form in full:

Describe the problem the feature is intended to solve

Running TF Serving at scale often requires debugging of latency through the system. When latency metrics are exposed with fixed buckets, the granular details can be lost (e.g. overhead due to poor tuning of the system). Providing traces via open telemetry can help increase the observability of the system as part of a larger architecture. This observability is extra critical in the domain of ML when payload sizes can be large and in recommender systems where RPS is high and target latencies are low.

Describe the solution

Expose traces of the higher level functions of TF serving via an open source standard such as OpenTelemetry.

Describe alternatives you've considered

There are no alternatives that will give the same observability without requiring us to maintain a separate fork of this repository. We already use the existing TF Serving metrics exposed and the Tensorboard profiler. Furthermore, Tensorboard profiling is not in a consumable format for tracing SASS.

Additional context

Related to: https://github.com/tensorflow/serving/issues/1955

Apr 18 '22 19:04 salliewalecka

Is there any update on this request (now >18 months old). Another Google project, kubernetes, released OpenTelemetry tracing support over a year ago https://kubernetes.io/blog/2022/12/01/runtime-observability-opentelemetry/.

It would be nice to know if there are any timelines/plans to add similar support to Tensorflow.

Dec 02 '23 02:12 evantorrie