router icon indicating copy to clipboard operation
router copied to clipboard

Router 2.2.1 version throws this telemetry error which is unknown

Open riginoommen opened this issue 6 months ago • 7 comments

2025-05-07T15:12:01.905848Z ERROR resource{service.namespace="rh-graphql",service.version="2.2.1-rc.1",service.name="rhg-router",process.executable.name="router",} tokio-runtime-worker ThreadId(10) apollo_router::plugins::telemetry::error_handler: apollo-router/src/plugins/telemetry/error_handler.rs:92 OpenTelemetry metric error occurred: Metrics exporter otlp failed with the grpc server returns error (Unknown error): , detailed error message: h2 protocol error: http2 error tonic::transport::Error(Transport, hyper::Error(Http2, Error { kind: GoAway(b"", FRAME_SIZE_ERROR, Library) }))

https://www.apollographql.com/docs/graphos/routing/errors

in the errors FRAME_SIZE_ERROR is not documented.

Router throws this log with 2.2.1 version for telemetry

riginoommen avatar May 07 '25 15:05 riginoommen

You didn't specify if you saw this with Router 2.2.0, or any other 2.x version, but I don't think this is in any way specific to Router 2.2.1.

To understand what's going on here though, we'd need to know specifically what your configuration is, and particularly the OTLP metrics configuration — what is your metrics OTLP endpoint string? You can redact the hostname, but knowing the full value you have set would be useful. (these are in telemetry.exporters in your config). From there, our next question will likely be what the telemetry endpoint is configured to be.

Overall, FRAME_SIZE_ERROR is an HTTP/2 error, and I'm pretty sure it's being returned by whatever endpoint you have that configured to be — not by the Router, you're just seeing the error in your logs. We don't document all HTTP/2 errors in our docs, but it's probably a protocol negotiation failure where the endpoint isn't quite figuring out how to finish the negotiation. Various resources come up in a search, but I think we should zoom in on the value of the endpoint and whether you need to specify a correct protocol in your configuration. OpenTelemetry has changed its mind over the years around how port 4317 and 4318 work, and it's probably something that once worked and changed once the OpenTelemetry standard normalized, which has happened during the life-time of the Router.

abernix avatar May 08 '25 11:05 abernix

telemetry:
  exporters:
    metrics:
      otlp:
        enabled: true
        endpoint: "${env.OTEL_EXPORTER_OTLP_ENDPOINT}"
        protocol: http
      common:
        service_name: "${env.OTEL_SERVICE_NAME}"
        service_namespace: rh-graphql
        resource:
          service.name: "${env.OTEL_SERVICE_NAME}"
          service_namespace: rh-graphql
    logging:
      common:
        service_name: "${env.OTEL_SERVICE_NAME}"
        service_namespace: rh-graphql
        resource:
          service.name: "${env.OTEL_SERVICE_NAME}"
          service_namespace: rh-graphql
      stdout:
        enabled: true
        format:
          text:
            ansi_escape_codes: true
            display_current_span: true
            display_filename: true
            display_level: true
            display_line_number: true
            display_resource: true
            display_service_name: true
            display_service_namespace: true
            display_span_id: true
            display_trace_id: true
            display_span_list: true
            display_target: true
            display_thread_id: true
            display_thread_name: true
            display_timestamp: true
        tty_format:
          text:
            ansi_escape_codes: true
            display_current_span: true
            display_filename: true
            display_level: true
            display_line_number: true
            display_resource: true
            display_service_name: true
            display_service_namespace: true
            display_span_id: true
            display_trace_id: true
            display_span_list: true
            display_target: true
            display_thread_id: true
            display_thread_name: true
            display_timestamp: true

    tracing:
      common:
        service_name: "${env.OTEL_SERVICE_NAME}"
      otlp:
        enabled: true
        endpoint: "${env.OTEL_EXPORTER_OTLP_ENDPOINT}"
        protocol: http

      experimental_response_trace_id:
        enabled: true
        format: hexadecimal
        header_name: rhg-trace-id
  apollo:
    buffer_size: 10000
    client_name_header: apollographql-client-name
    client_version_header: apollographql-client-version
    endpoint: "https://usage-reporting.api.apollographql.com/api/ingress/traces"
    experimental_local_field_metrics: false
    experimental_otlp_endpoint: "https://usage-reporting.api.apollographql.com/"
    send_variable_values: none
    send_headers:
      except:
        - Authorization

  instrumentation:
    spans:
      ## Set the mode for spans to be specification compliant.
      mode: spec_compliant
      default_attribute_requirement_level: required
      supergraph:
        attributes:
          cost.result:
            alias: rhg-query-cost-result
          cost.actual:
            alias: rhg-query-cost-actual
          cost.estimated:
            alias: rhg-query-cost-estimated
          cost.delta:
            alias: rhg-query-cost-delta
          graphql.document:
            alias: rhg-graphql-query
          graphql.operation.name:
            alias: rhg-graphql-operation-name
          graphql.operation.type:
            alias: rhg-graphql-operation-type
      router:
        attributes:
          baggage: true
      subgraph:
        attributes:
          subgraph.name: true
          subgraph.graphql.document: true
          subgraph.graphql.operation.name: true
          subgraph.graphql.operation.type: true
          http.request.resend_count: true

    instruments:
      default_requirement_level: required
      router:
        http.server.active_requests: true
        http.server.request.body.size: true
        http.server.request.duration: true
        http.server.response.body.size: true
      subgraph:
        http.client.request.body.size: true
        http.client.request.duration: true
        http.client.response.body.size: true
      connector:
        http.client.request.body.size: true
        http.client.request.duration: true
        http.client.response.body.size: true
      graphql:
        field.execution: true
        list.length: true
      cache:
        apollo.router.operations.entity.cache: true
      supergraph:
        cost.actual: true
        cost.delta: true
        cost.estimated: true

    events:
      supergraph:
        COST_ACTUAL_TOO_EXPENSIVE:
          message: "cost actual is high"
          on: event_response
          level: error
          condition:
            gt:
              - cost: actual
              - 250000
          attributes:
            graphql.operation.name: true
            cost.actual: true

this is my telemetry configurarion

riginoommen avatar May 14 '25 09:05 riginoommen

in 1.x version everything works fine

riginoommen avatar May 21 '25 06:05 riginoommen

what does OTEL_EXPORTER_OTLP_ENDPOINT look like? what's the scheme? (is it specified or is it missing)? what's the port? what's the path?

can you provide a docker compose with a repo?

abernix avatar May 22 '25 20:05 abernix

docker-compose file

version: '3.8'

services:
  # Jaeger service for OpenTelemetry tracing
  jaeger:
    image: jaegertracing/all-in-one:1.35
    container_name: jaeger
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC receiver
      - "4318:4318"    # OTLP HTTP receiver
      - "5778:5778"    # Configuration API
      - "9411:9411"    # Zipkin compatible endpoint
    networks:
      - apollo-network

  # Apollo Router service
  apollo-router:
    build:
      context: .  # Build the Apollo Router from the Dockerfile in the current directory
      dockerfile: Dockerfile
    container_name: apollo-router
    ports:
      - "4000:4000"  # Apollo Router default port
    env_file:
      - docker.env
    depends_on:
      - jaeger  # Ensure Jaeger starts before the Apollo Router
    networks:
      - apollo-network

# Define a custom network for communication between services
networks:
  apollo-network:
    driver: bridge

riginoommen avatar May 23 '25 02:05 riginoommen

OTEL_EXPORTER_OTLP_GRPC_ENDPOINT=http://localhost:4317

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

riginoommen avatar May 26 '25 21:05 riginoommen

When OTEL_EXPORTER_OTLP_GRPC_ENDPOINT is set to http://localhost:4317/ (for OTLP/gRPC trace collection) and OTEL_EXPORTER_OTLP_ENDPOINT is also set to http://localhost:4318/, issues arise. These problems specifically occur even if the OTEL_EXPORTER_OTLP_ENDPOINT variable is not actively used for data export.

For us in our openshift envs OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_GRPC_ENDPOINT envs are set by default by IT. router reads two env variables by default and throws error

riginoommen avatar Jun 04 '25 13:06 riginoommen

I have next error

OpenTelemetry metric error occurred: Metrics exporter otlp failed with the grpc server returns error (Unknown error): , detailed error message: transport error tonic::transport::Error(Transport, hyper::Error(Io, Kind(ConnectionReset)))

but exporting works fine. I wonder if it is possible to configure logging to skip this error. What should I use to do that? I tried RUST_LOG=tonic::transport/ConnectionReset but doesn't seem to work.

jfrog-pipelie-intg avatar Jun 25 '25 15:06 jfrog-pipelie-intg