envoy icon indicating copy to clipboard operation
envoy copied to clipboard

tracing: Transition to OpenTelemetry from OpenTracing and OpenCensus

Open moderation opened this issue 5 years ago • 60 comments

Title: Plan to transition to OpenTelemetry

Description: Envoy currently supports OpenTracing (OTr) and OpenCensus (OC) [0]. In May 2019 it was announced that these projects were merging into the new OpenTelemetry (OTel) project [1]. The original plan was to have the legacy project repos moved to read-only by the end of 2019. That hasn't happened but according to the OTel maintainers they are aiming for beta releases in March [2].

Should Envoy:

  • keep OTr and OC and add OTel (eventually deprecating the legacy projects)?
  • replace OTr and OC with OTel?

https://github.com/envoyproxy/envoy/pull/9955 is planning on adding to the OC capability. The OC service / agent repo says its in maintenance mode and points to OTel/opentelemetry-collector

Relevant Links: 0 : config.trace.v3.Tracing.Http 1 : OpenTracing, OpenCensus Merge into a Single New Project, OpenTelemetry 2 : OpenTelemetry Monthly Update: January 2020

moderation avatar Feb 07 '20 02:02 moderation

Probably also depends on the C++ otel library so it can be used in envoy. https://github.com/open-telemetry/opentelemetry-cpp is still in active development and has no releases yet.

yvespp avatar Aug 06 '20 12:08 yvespp

Getting closer - RC1 - https://opensource.googleblog.com/2020/10/opentelemetrys-first-release-candidates.html

moderation avatar Oct 23 '20 02:10 moderation

Announced at re:invent December 2020 - AWS Distro for OpenTelemetry - https://aws-otel.github.io/

moderation avatar Dec 03 '20 17:12 moderation

As a member of the OpenTelemetry community, I thought I would share that the OpenTelemetry Tracing specification 1.0.1 has been released.

Long term support includes:

  • API: 3 year support guarantee
  • Plugin Interfaces: 1 year support guarantee
  • Constructors: 1 year support guarantee

Announcement blog: https://medium.com/opentelemetry/opentelemetry-specification-v1-0-0-tracing-edition-72dd08936978

Link to specifications: https://github.com/open-telemetry/opentelemetry-specification

gramidt avatar Mar 18 '21 16:03 gramidt

@mattklein123 - Thoughts on next steps for this? I'm happy to coordinate with members of the OpenTelemetry community to help with the implementation.

gramidt avatar Mar 22 '21 19:03 gramidt

@gramidt in general the tracing work tends to be driven by vendors and those with a vested interest. We are more than happy to see this work done so please go for it once resources are found! Thank you.

mattklein123 avatar Mar 22 '21 19:03 mattklein123

Sounds like a plan, @mattklein123! Thank you for the prompt response.

gramidt avatar Mar 22 '21 19:03 gramidt

Is there any progress on this issue?

CoderPoet avatar May 05 '21 06:05 CoderPoet

Hi @CoderPoet -

I'm not aware of any progress here. Is this an immediate priority for you/your team?

gramidt avatar May 06 '21 12:05 gramidt

@gramidt Did you kick off any conversations to coordinate any work for this? We are interested to see OTLP support from Envoy at least for tracing data so we can position the Otel collector in the export path.

rakyll avatar Jun 24 '21 21:06 rakyll

One more thing...

There are two main areas when it comes to OpenTelemetry support:

  1. Exporting OTLP spans from Envoy
  2. Instrumenting Envoy with OpenTelemetry and removing OpenTracing

We can tackle (1) and (2) independently. (2) is possible without having to reinstrument Envoy by linking the OpenTelemetry <-> OpenTracing bridge. Eventually, we can get to (2) to remove the OpenTracing dependency. I think Envoy users can benefit from (1) immediately, and we should prioritize it.

rakyll avatar Jun 28 '21 20:06 rakyll

@rakyll - I have had some conversations both internally and externally in the community, but no progress has been made that I am aware of. A recent message from an employee of Adobe says that they may have someone who would be interested in making the contribution for exporting OTLP spans from Envoy.

gramidt avatar Jun 29 '21 19:06 gramidt

Has it be considered to break up (2) above into an OpenTracing shim vs a full native OpenTelemetry implementation? One aspect that would solve is context propagation in environments that are currently only using TraceContext (or composite propagators without B3 being one of them), without having to wait for a full OpenTelemetry instrumentation. One could export spans via OTLP (1), but if context propagation still depends on B3 headers, which are becoming less used in new environments, the full value of tracing will not be achieved.

danielgblanco avatar Jul 02 '21 11:07 danielgblanco

Hi @gramidt,

Has there been any update on this in regards to resourcing the work? We also have a similar interest on the topic for the same reason as @rakyll and was wondering if there is any planned roadmap / expected timeline at this stage.

Thanks!

Tenaria avatar Sep 16 '21 04:09 Tenaria

@Tenaria - Sadly, I have not heard of any progress on this.

gramidt avatar Sep 16 '21 12:09 gramidt

@gramidt sorry for a naive question: do we intend to use https://github.com/open-telemetry/opentelemetry-cpp? Last time I check it has a similar approach with the current OC impl. By exporting OTLP spans only, is it converting OC Spans to OTLP's?

dio avatar Sep 16 '21 20:09 dio

Hi @gramidt, I'd be interested in taking on this work!

I've been discussing the potential approach with Harvey (@htuch), and it looks like it may make sense to add a Tracer that uses the OpenTelemetry protos and C++ API. The only potential issue is the stability of the C++ SDK, and depending on how stable it is (I see there was a 1.0.0 release recently), it may make sense to use Envoy's gRPC service instead of the SDK for exporting (similar to what @itamarkam did for the OpenTelemetry logger extension in https://github.com/envoyproxy/envoy/pull/15105).

Either way, still interested in tackling this this :)

AlexanderEllis avatar Sep 21 '21 21:09 AlexanderEllis

Hi all! I guess there hasn't been any movement around here for adding OpenTelemetry support. For some reason i thought that it could be enabled as a dynamic_ot tracer similar to https://github.com/jaegertracing/jaeger-client-cpp but I don't think anybody built one.

Is it safe to say that there is no way to send traces to an opentelemetry collector at this point, yes?

inquire avatar Nov 15 '21 15:11 inquire

@inquire I don't have the documentation at my fingertips, but OTel receivers can be configured to receive traces/metrics/logs from a variety of different sources, including OpenTracing, Jaeger, and others in addition to OTel's own OTLP. So it's not too hard to emit OpenTracing and send it to an endpoint that's actually OTel.

kevincantu avatar Nov 15 '21 18:11 kevincantu

@kevincantu is correct, I'm trying to setup my sidecar Envoy proxy 1.20 (aws) to report to a OTel collector sidecar using: ENABLE_ENVOY_JAEGER_TRACING=true on AWS. My collector is setup for recieving jaeger/zipkin/otel (AWS defaults to 127.0.0.1:9411)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
  jaeger:
    protocols:
      grpc:
      thrift_binary:
      thrift_compact:
      thrift_http:
  zipkin:
    
exporters:
  otlp:
    endpoint: ${tempo_sd}:4317
    tls:
      insecure: true
    sending_queue:
      num_consumers: 4
      queue_size: 100
    retry_on_failure:
      enabled: true
  logging:
    loglevel: debug
    sampling_initial: 5
    sampling_thereafter: 200
processors:
  batch:
  memory_limiter:
    # 80% of maximum memory up to 2G
    limit_mib: 400
    # 25% of limit up to 2G
    spike_limit_mib: 100
    check_interval: 5s
extensions:
  zpages: {}
  memory_ballast:
    # Memory Ballast size should be max 1/3 to 1/2 of memory.
    size_mib: 165
service:
  extensions: [zpages, memory_ballast]
  pipelines:
    traces:
      receivers: [otlp, jaeger, zipkin]
      processors: [memory_limiter, batch]
      exporters: [otlp, logging]

no luck so far, I'll keep trying this and update on a workaround (until we ship openTelemetry with envoy)

mrsufgi avatar Nov 29 '21 08:11 mrsufgi

@mrsufgi one problem with the current jaeger tracer is that it doesn't support the W3C trace context so far. Therefore we enabled the OpenCensus tracer, which is connected to an OpenTelemetry collector, exposing the OpenCensus protocol on port 55678.

istio config:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    accessLogFormat: >
      [%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%"
      %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION%
      %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%"
      "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" "%REQ(traceparent)%" "%REQ(tracestate)%"\n
    defaultConfig:
      # https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#Tracing
      tracing:
        openCensusAgent:
          address: "dns:opentelemetry-collector.istio-system.svc.cluster.local:55678" #gRPC specific address
          context: # Specifies the set of context propagation headers used for distributed tracing. Default is ["W3C_TRACE_CONTEXT"]. If multiple values are specified, the proxy will attempt to read each header for each request and will write all headers.
            - "W3C_TRACE_CONTEXT"
    enableTracing: true
  values:
    global:
      proxy:
        tracer: openCensusAgent #required to enable the tracer config on the envoy, by default the the zipkin tracer gets used
        resources:
          requests:
            cpu: 10m
            memory: 40Mi

PatrikSteuer avatar Nov 29 '21 09:11 PatrikSteuer

@mrsufgi one problem with the current jaeger tracer is that it doesn't support the W3C trace context so far. Therefore we enabled the OpenCensus tracer, which is connected to an OpenTelemetry collector, exposing the OpenCensus protocol on port 55678.

istio config:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    accessLogFormat: >
      [%START_TIME%] "%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%"
      %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION%
      %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% "%REQ(X-FORWARDED-FOR)%" "%REQ(USER-AGENT)%"
      "%REQ(X-REQUEST-ID)%" "%REQ(:AUTHORITY)%" "%UPSTREAM_HOST%" "%REQ(traceparent)%" "%REQ(tracestate)%"\n
    defaultConfig:
      # https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#Tracing
      tracing:
        openCensusAgent:
          address: "dns:opentelemetry-collector.istio-system.svc.cluster.local:55678" #gRPC specific address
          context: # Specifies the set of context propagation headers used for distributed tracing. Default is ["W3C_TRACE_CONTEXT"]. If multiple values are specified, the proxy will attempt to read each header for each request and will write all headers.
            - "W3C_TRACE_CONTEXT"
    enableTracing: true
  values:
    global:
      proxy:
        tracer: openCensusAgent #required to enable the tracer config on the envoy, by default the the zipkin tracer gets used
        resources:
          requests:
            cpu: 10m
            memory: 40Mi

Cool, since I'm using AWS Envoy which doesn't support openCensusAgent out of the box. perhaps I'll need to use the vanilla envoyproxy or extend aws envoy using ENVOY_TRACING_CFG_FILE and set it up. Ill share my results.

mrsufgi avatar Nov 29 '21 15:11 mrsufgi

I managed to make it work with OpenCensus & OpenTelemetry agent configured to receive OpenCensus and send OpenTelemetry further. This will also connect the traces and pass the context around. Here are my config files.

Config working with envoy 1.20

static_resources:
  listeners:

    # This defines Envoy's externally-facing listener port
    - name: "inbound_listener"
      address:
        socket_address:
          address: 0.0.0.0
          port_value:5000
      filter_chains:
        - filters:
            - name: envoy.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: auto
                stat_prefix: ingress_http
                generate_request_id: true
                tracing:
                  custom_tags:
                    - tag: "k8s_deployment_name"
                      environment:
                        name: K8S_DEPLOYMENT_NAME
                        default_value: local
                  provider:
                    name: envoy.tracers.opencensus
                    typed_config:
                      "@type": type.googleapis.com/envoy.config.trace.v3.OpenCensusConfig
                      stdout_exporter_enabled: false
                      ocagent_exporter_enabled: true
                      ocagent_address: localhost:55678
                      incoming_trace_context: b3
                      incoming_trace_context: trace_context
                      incoming_trace_context: grpc_trace_bin
                      outgoing_trace_context: b3
                      outgoing_trace_context: trace_context

OTEL Agent config used with otel/opentelemetry-collector:0.24.0

receivers:
  otlp:
    protocols:
      grpc:
      http:
  opencensus: # you only need this one 
  jaeger:
    protocols:
      grpc:
      thrift_http:
  zipkin:

exporters:
  otlp:
    endpoint: "opentelemetry-collector:50051"
    insecure: true
  logging:
    loglevel: debug

processors:
  batch:

extensions:
  pprof:
    endpoint: :1777
  zpages:
    endpoint: :55679
  health_check:

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp, opencensus, jaeger, zipkin]
      processors: [batch]
      exporters: [otlp, logging]
    metrics:
      receivers: [otlp, opencensus]
      processors: [batch]
      exporters: [otlp, logging]

Envoy will ship OpenCensus to the collector which will convert it to OpenTelemetry thus achieving your goal.

inquire avatar Nov 29 '21 16:11 inquire

I didnt added my solution here yet :) I managed to work envoy issues with AWS Fargate and Appmesh!

First I had to create a custom envoy docker that add config to the image (note that there are many methods for adding configs to images without creating a custom Docker

FROM public.ecr.aws/appmesh/aws-appmesh-envoy:v1.20.0.1-prod

COPY config.yaml /etc/envoy/envoy-tracing-config.yaml

My tracing config for envoy looks like this:

tracing:
  http:
    name: envoy.tracers.opencensus
    typed_config:
      "@type": type.googleapis.com/envoy.config.trace.v3.OpenCensusConfig
      trace_config:
        max_number_of_attributes: 500
      stdout_exporter_enabled: true
      ocagent_exporter_enabled: true
      ocagent_address: 0.0.0.0:55678
      incoming_trace_context:
        - trace_context
        - grpc_trace_bin
        - b3
      outgoing_trace_context:
        - trace_context
        - b3

My otel-collector config is accepting opencensus/jaeger and otlp:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
  jaeger:
    protocols:
      grpc:
      thrift_binary:
      thrift_compact:
      thrift_http:
  zipkin:
  opencensus:
    endpoint: 0.0.0.0:55678

exporters:
  otlp:
    endpoint: scribe-dev-tempo.scribe-dev-app-sd:4317
    tls:
      insecure: true
    sending_queue:
      num_consumers: 4
      queue_size: 100
    retry_on_failure:
      enabled: true
  logging:
    loglevel: debug
    sampling_initial: 5
    sampling_thereafter: 200
processors:
  batch:
  memory_limiter:
    # 80% of maximum memory up to 2G
    limit_mib: 400
    # 25% of limit up to 2G
    spike_limit_mib: 100
    check_interval: 5s
extensions:
  zpages: {}
  memory_ballast:
    # Memory Ballast size should be max 1/3 to 1/2 of memory.
    size_mib: 165
service:
  extensions: [zpages, memory_ballast]
  pipelines:
    traces:
      receivers: [otlp, jaeger, zipkin, opencensus]
      processors: [memory_limiter, batch]
      exporters: [otlp, logging]

(pretty sure I can remove my zipkin)

Just note that my envoy sidecar and my otel-collector sidecar and actual my own api service runs on the same service. Also, In Fargate you will need to add the ENVOY_TRACING_CFG_FILE env variable to your envoy task def.

In order for everything to play together I'm using the B3 headers since I'm also using Traefik (which doesn't support trace-context).

Hope it helps as a workaround for now :honey_pot:

mrsufgi avatar Dec 06 '21 09:12 mrsufgi

@gramidt helpwanted label was removed, but I am not sure somebody is working on it. Could you give a progress update?

verysonglaa avatar Dec 21 '21 08:12 verysonglaa

I've been working on adding OpenTelemetry tracing and can give a quick update. I currently have a rough draft of adding a new tracer that sends OTLP traces via Envoy's async gRPC client with configurable batching (similar to how the Zipkin tracer works), and I'm planning on wrapping it up and sending a PR after the holidays in early/mid January.

AlexanderEllis avatar Dec 21 '21 15:12 AlexanderEllis

Hello 👋 just wondering if there's been any update on this from the last comment 👀 ? Thanks!

Tenaria avatar Feb 01 '22 23:02 Tenaria

Exciting update! Hi @AlexanderEllis , what's the benefit of your proposed approach versus existing envoy opencensus to open telemetry collector approach?

ydzhou avatar Feb 02 '22 23:02 ydzhou

@ydzhou Hey!

From observation, OC is missing a few features that make it fully compatible with OTel. The noticeable ones that I've found so far are:

  • Missing instrumentation library name. This is important in allowing different instrumentation following the same semantic convention to be differentiated. For example, if I had http tracing instrumented for my application as well as Envoy, the only way to differentiate them under the OTel convention is by the instrumentation library field, otherwise, in theory the fields should look the same.
  • Missing span status kind which sets whether the span is an error, OK or unset. This provides a simpler experience to the user as they can directly filter on span status to identify spans in errors rather than relying on a user to necessarily understand what kinds of spans would be errors and filter on those specifically e.g spans with status code 5xx.

I'm sure there might be a few other differences, but these were the two major ones that came to mind during my experience with using/seeing OC.

I'm not entirely sure about the implementation by @AlexanderEllis given I'm not working on it with them, but I assume that if it is going to implement OTel natively, then it will address these issues.

Tenaria avatar Feb 07 '22 21:02 Tenaria