router icon indicating copy to clipboard operation
router copied to clipboard

High Latency of Query planning in v1.35.0 w/ high load when OTLP tracing is enabled

Open Saurabh-Dream11 opened this issue 1 year ago • 8 comments

Hello Team Apollo Router, We are in the process of upgrading our apollo router version from v1.11.0 to v1.35.0. We also have added tracing capability through OTLP in our router.yaml which uses gRPC protocol to send out traces to the Datadog-agent configured with trace_buffer value 3000. We have observed that while the application metrics work smoothly for <130k req/min, on increasing the load to 140k req/min on a single AWS EC2 instance, the p99 latency for Query planning takes a hit and the number of errors also shoot up.

Query planning numbers 130k req/min - 12.6ms 140k req/min - 7.8s

This happens when there is no considerable increase in CPU usage, while we have observed that the memory usage increased considerably in case of higher load.

image

Error in case of 140k req/min

image memory for 140k req/min image memory for 130k req/min image CPU for 130k req/min image CPU for 140k req/min

Steps to reproduce the behavior:

  1. Apollo router v1.35.0
  2. router.yaml config
telemetry:
  instrumentation:
    spans:
      default_attribute_requirement_level: required
  exporters:
    tracing:
      common:
        resource:
          "service.name": "gql-router"
      otlp:
        enabled: true
        endpoint: 0.0.0.0:4317
        protocol: grpc
        batch_processor:
          scheduled_delay: 10s
          max_concurrent_exports: 300
          max_export_batch_size: 3000
          max_export_timeout: 30s
          max_queue_size: 2000000
      propagation:
        datadog: true
    logging:
      common:
        service_name: "gql-router"
      stdout:
        enabled: true
        format:
          # text:
          #   ansi_escape_codes: false
          #   display_filename: false
          #   display_level: true
          #   display_line_number: false
          #   display_target: false
          #   display_thread_id: false
          #   display_thread_name: false
          #   display_timestamp: true
          #   display_resource: true
          #   display_span_list: false
          #   display_current_span: false
          #   display_service_name: true
          #   display_service_namespace: true
          json:
            display_filename: false
            display_level: true
            display_line_number: false
            display_target: false
            display_thread_id: false
            display_thread_name: false
            display_timestamp: true
            display_current_span: true
            display_span_list: false
            display_resource: true
      experimental_when_header:
        - name: debug
          value: 'true'
          headers: true # default: false
          body: true
  1. Hit load of 130k req/min and then 140k req/min (Numbers might vary for your application)
  2. Look at query planning latency, memory and CPU usage

Expected behavior The query planning process must not show spikes in latency numbers over a period of time.

Desktop (please complete the following information):

  • AWS EC2 instance
  • 16 Gb RAM
  • 8 vCPUs

Saurabh-Dream11 avatar Feb 16 '24 08:02 Saurabh-Dream11

Hi @Saurabh-Dream11 . It's possible that in the transition from 130k -> 140k req/min that you have exceeded the size of the qp cache and that is causing more planning to occur. The amount of additional cpu consumed might not be enough to detect on the monitoring, but would directly impact latency. I would also expect this to result in higher memory consumption (due to the nature of query planning memory consumption).

We can test this theory by examining the following metrics:

  • apollo_router_query_planning_time
  • apollo_router_cache_hit_count
  • apollo_router_cache_miss_count

If we find that may be the cause of the problem, we can increase the size of the query planning cache using config.

Let us know how you get on.

garypen avatar Feb 16 '24 08:02 garypen

Thanks @garypen for the above solution.

We are currently on router v1.35.0 so we tried using experimental_cache. Increasing the value of experimental_cache.in_memory.limit to 10000, didn't help.

Although increasing the warmed_up_queries config to 100, helped. The service was able to sustain 140k req/min, but as we scaled it to 150k req/min, the issues occurred again.

Also we are seeing a dip in the apollo_router_cache_hit_count value as soon as the load starts to increase beyond the 140k req/min limit.

image

Saurabh-Dream11 avatar Feb 19 '24 12:02 Saurabh-Dream11

Hi @Saurabh-Dream11

I wouldn't expect setting warmed_up_queries would make any difference in your situation unless the schema is changing frequently at the same time as the router is executing. Reason: warm up helps when a router is starting or a schema is changing and queries need to be planned or re-planned.

I would have expected increasing the QP limit to have some effect though, so maybe the root cause is something different.

If you remove the OTLP tracing, does that impact the error rate or the performance of the router?

garypen avatar Feb 19 '24 13:02 garypen

Yes, If we move back to native Datadog tracing config, we don't encounter such errors.

Saurabh-Dream11 avatar Feb 20 '24 06:02 Saurabh-Dream11

It's difficult to configure OTLP tracing parameters precisely and it's important to get them right. Particularly when working the router harder.

There is some explanation of the parameters in the opentelemetry crate, but without precise knowledge of your deployment goals it's very difficult for me to tell you exactly the correct values. Here are some experimental figures that I've found to work well for high throughput:

      telemetry:
        apollo:
          batch_processor:
            # default 5s
            scheduled_delay: 10s
            # default 1
            max_concurrent_exports: 20
            # default 512
            max_export_batch_size: 2048
            max_export_timeout: 30s
            # default 2048
            max_queue_size: 8192

I can see that you've spent some time configuring your values already, so it may be that you already understand this well for your environment, in which case I think it is really a matter of tweaking things until they are working well for you.

garypen avatar Feb 20 '24 08:02 garypen

@garypen Configuring the batch-processor doesn't have any effect on the query planning time in case of our high load setup. From what we see, the query-planning time is the only bottleneck here. It's exceeding the timeout of 8s set for the requests in the traffic-shaping configuration. Also, we are hitting only 8 queries in our load setup, so increasing the limit in cache also causes no significant change. Correct me if I missed something.

Saurabh-Dream11 avatar Feb 21 '24 04:02 Saurabh-Dream11

@garypen When we use the native Datadog config (without OTLP) for v1.0.35 for tracing in the router.yaml, we experience a similar behaviour in QP latency but at 165k req/min. The CPU max usage was <60%.

image

image

image

router.yaml

rhai:
  scripts: "./rhai_scripts"
  main: "header.rhai"
plugins:
  graphql_router.super_graph:
    message: "starting supergraph plugin"
supergraph:
  listen: 0.0.0.0:7777
  path: "/*"
  introspection: ${env.IS_LOWER_ENV}
sandbox:
  enabled: ${env.IS_LOWER_ENV}
homepage:
  enabled: false
health_check:
  listen: 0.0.0.0:7777
  enabled: true
include_subgraph_errors:
  all: true # Propagate errors from all subraphs
headers:
  all:
    request:
      - propagate:
          matching: .*
telemetry:
  exporters:
    tracing:
      common:
        resource:
          "service.name": "gql-router"
      # Optional. Either a float between 0 and 1 or 'always_on' or 'always_off'
        sampler: ${env.TRACING_SAMPLING_RATE}
      # Optional. Use a parent based sampler. This enables remote spans help make a decision on if a span is sampeld or not.
      # https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#parentbased
        parent_based_sampler: false
      # Optional limits
      # max_attributes_per_event: 10
      # max_attributes_per_link: 10
      # max_attributes_per_span: 10
      # max_events_per_span: 10
      # max_links_per_span: 10
      # Attributes particular to an exporter that have not
      # been explicitly handled in Router configuration.
      #attributes:
      #  some.config.attribute: "config value"
      propagation:
        # https://www.w3.org/TR/baggage/
        baggage: false
        # https://www.datadoghq.com/
        datadog: true
        # https://www.jaegertracing.io/ (compliant with opentracing)
        jaeger: false
        # https://www.w3.org/TR/trace-context/
        trace_context: false
        # https://zipkin.io/ (compliant with opentracing)
        zipkin: false
        # If you have your own way to generate a trace id and you want to pass it via a custom request header
        #request:
        #  header_name: my-trace-id
      datadog:
        endpoint: default
        enabled: true
        batch_processor:
          scheduled_delay: 10s
          max_concurrent_exports: 300
          max_export_batch_size: 10000
          max_export_timeout: 30s
          max_queue_size: 2000000

traffic_shaping:
  deduplicate_variables: true # Enable the variable deduplication optimization.
  router: # Rules applied to requests from clients to the router
    # global_rate_limit: # Accept a maximum of 10 requests per 5 secs. Excess requests must be rejected.
    #   capacity: 10
    #   interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
    timeout: 8s # If a request to the router takes more than 8secs then cancel the request (30 sec by default)
  all:
    deduplicate_query: true # Enable query deduplication for all subgraphs.
    timeout: 7s
    # compression: gzip # Enable brotli compression for all subgraphs.
  # subgraphs: # Rules applied to requests from the router to individual subgraphs
  #   products:
  #     deduplicate_query: false # Disable query deduplication for the products subgraph.
  #     compression: gzip # Enable gzip compression only for the products subgraph.
  #     global_rate_limit: # Accept a maximum of 10 requests per 5 secs from the router. Excess requests must be rejected.
  #       capacity: 10
  #       interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
  #     timeout: 50s # If a request to the subgraph 'products' takes more than 50secs then cancel the request (30 sec by default)
  # ...

Saurabh-Dream11 avatar Feb 21 '24 12:02 Saurabh-Dream11

Hi Team, We used the following config for router.yaml (without any tracing configuration), yet we observed a similar behaviour, where, the latency is spiking at a threshold of 160k RPM. Attached is the screen-grab for load and latency on a C5-2x Large EC2 instance.

rhai:
  scripts: "./rhai_scripts"
  main: "header.rhai"
plugins:
  graphql_router.super_graph:
    message: "starting supergraph plugin"
supergraph:
  query_planning:
    warmed_up_queries: 1000
    experimental_cache:
      in_memory:
        limit: 100000
  listen: 0.0.0.0:7777
  path: "/*"
  introspection: ${env.IS_LOWER_ENV}
sandbox:
  enabled: ${env.IS_LOWER_ENV}
homepage:
  enabled: false
health_check:
  listen: 0.0.0.0:7777
  enabled: true
include_subgraph_errors:
  all: true # Propagate errors from all subraphs
headers:
  all:
    request:
      - propagate:
          matching: .*
 
traffic_shaping:
  deduplicate_variables: true # Enable the variable deduplication optimization.
  router: # Rules applied to requests from clients to the router
    # global_rate_limit: # Accept a maximum of 10 requests per 5 secs. Excess requests must be rejected.
    #   capacity: 10
    #   interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
    timeout: 8s # If a request to the router takes more than 8secs then cancel the request (30 sec by default)
  all:
    deduplicate_query: true # Enable query deduplication for all subgraphs.
    timeout: 7s
    # compression: gzip # Enable brotli compression for all subgraphs.
  # subgraphs: # Rules applied to requests from the router to individual subgraphs
  #   products:
  #     deduplicate_query: false # Disable query deduplication for the products subgraph.
  #     compression: gzip # Enable gzip compression only for the products subgraph.
  #     global_rate_limit: # Accept a maximum of 10 requests per 5 secs from the router. Excess requests must be rejected.
  #       capacity: 10
  #       interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
  #     timeout: 50s # If a request to the subgraph 'products' takes more than 50secs then cancel the request (30 sec by default)

image p99 Latency for 160k RPM

image 160k RPM throughput on ALB

image Max CPU usage

@garypen

Saurabh-Dream11 avatar Feb 26 '24 10:02 Saurabh-Dream11