router
router copied to clipboard
High Latency of Query planning in v1.35.0 w/ high load when OTLP tracing is enabled
Hello Team Apollo Router, We are in the process of upgrading our apollo router version from v1.11.0 to v1.35.0. We also have added tracing capability through OTLP in our router.yaml which uses gRPC protocol to send out traces to the Datadog-agent configured with trace_buffer value 3000. We have observed that while the application metrics work smoothly for <130k req/min, on increasing the load to 140k req/min on a single AWS EC2 instance, the p99 latency for Query planning takes a hit and the number of errors also shoot up.
Query planning numbers 130k req/min - 12.6ms 140k req/min - 7.8s
This happens when there is no considerable increase in CPU usage, while we have observed that the memory usage increased considerably in case of higher load.
Error in case of 140k req/min
Steps to reproduce the behavior:
- Apollo router v1.35.0
- router.yaml config
telemetry:
instrumentation:
spans:
default_attribute_requirement_level: required
exporters:
tracing:
common:
resource:
"service.name": "gql-router"
otlp:
enabled: true
endpoint: 0.0.0.0:4317
protocol: grpc
batch_processor:
scheduled_delay: 10s
max_concurrent_exports: 300
max_export_batch_size: 3000
max_export_timeout: 30s
max_queue_size: 2000000
propagation:
datadog: true
logging:
common:
service_name: "gql-router"
stdout:
enabled: true
format:
# text:
# ansi_escape_codes: false
# display_filename: false
# display_level: true
# display_line_number: false
# display_target: false
# display_thread_id: false
# display_thread_name: false
# display_timestamp: true
# display_resource: true
# display_span_list: false
# display_current_span: false
# display_service_name: true
# display_service_namespace: true
json:
display_filename: false
display_level: true
display_line_number: false
display_target: false
display_thread_id: false
display_thread_name: false
display_timestamp: true
display_current_span: true
display_span_list: false
display_resource: true
experimental_when_header:
- name: debug
value: 'true'
headers: true # default: false
body: true
- Hit load of 130k req/min and then 140k req/min (Numbers might vary for your application)
- Look at query planning latency, memory and CPU usage
Expected behavior The query planning process must not show spikes in latency numbers over a period of time.
Desktop (please complete the following information):
- AWS EC2 instance
- 16 Gb RAM
- 8 vCPUs
Hi @Saurabh-Dream11 . It's possible that in the transition from 130k -> 140k req/min that you have exceeded the size of the qp cache and that is causing more planning to occur. The amount of additional cpu consumed might not be enough to detect on the monitoring, but would directly impact latency. I would also expect this to result in higher memory consumption (due to the nature of query planning memory consumption).
We can test this theory by examining the following metrics:
apollo_router_query_planning_timeapollo_router_cache_hit_countapollo_router_cache_miss_count
If we find that may be the cause of the problem, we can increase the size of the query planning cache using config.
Let us know how you get on.
Thanks @garypen for the above solution.
We are currently on router v1.35.0 so we tried using experimental_cache.
Increasing the value of experimental_cache.in_memory.limit to 10000, didn't help.
Although increasing the warmed_up_queries config to 100, helped. The service was able to sustain 140k req/min, but as we scaled it to 150k req/min, the issues occurred again.
Also we are seeing a dip in the apollo_router_cache_hit_count value as soon as the load starts to increase beyond the 140k req/min limit.
Hi @Saurabh-Dream11
I wouldn't expect setting warmed_up_queries would make any difference in your situation unless the schema is changing frequently at the same time as the router is executing. Reason: warm up helps when a router is starting or a schema is changing and queries need to be planned or re-planned.
I would have expected increasing the QP limit to have some effect though, so maybe the root cause is something different.
If you remove the OTLP tracing, does that impact the error rate or the performance of the router?
Yes, If we move back to native Datadog tracing config, we don't encounter such errors.
It's difficult to configure OTLP tracing parameters precisely and it's important to get them right. Particularly when working the router harder.
There is some explanation of the parameters in the opentelemetry crate, but without precise knowledge of your deployment goals it's very difficult for me to tell you exactly the correct values. Here are some experimental figures that I've found to work well for high throughput:
telemetry:
apollo:
batch_processor:
# default 5s
scheduled_delay: 10s
# default 1
max_concurrent_exports: 20
# default 512
max_export_batch_size: 2048
max_export_timeout: 30s
# default 2048
max_queue_size: 8192
I can see that you've spent some time configuring your values already, so it may be that you already understand this well for your environment, in which case I think it is really a matter of tweaking things until they are working well for you.
@garypen Configuring the batch-processor doesn't have any effect on the query planning time in case of our high load setup. From what we see, the query-planning time is the only bottleneck here. It's exceeding the timeout of 8s set for the requests in the traffic-shaping configuration.
Also, we are hitting only 8 queries in our load setup, so increasing the limit in cache also causes no significant change. Correct me if I missed something.
@garypen When we use the native Datadog config (without OTLP) for v1.0.35 for tracing in the router.yaml, we experience a similar behaviour in QP latency but at 165k req/min. The CPU max usage was <60%.
router.yaml
rhai:
scripts: "./rhai_scripts"
main: "header.rhai"
plugins:
graphql_router.super_graph:
message: "starting supergraph plugin"
supergraph:
listen: 0.0.0.0:7777
path: "/*"
introspection: ${env.IS_LOWER_ENV}
sandbox:
enabled: ${env.IS_LOWER_ENV}
homepage:
enabled: false
health_check:
listen: 0.0.0.0:7777
enabled: true
include_subgraph_errors:
all: true # Propagate errors from all subraphs
headers:
all:
request:
- propagate:
matching: .*
telemetry:
exporters:
tracing:
common:
resource:
"service.name": "gql-router"
# Optional. Either a float between 0 and 1 or 'always_on' or 'always_off'
sampler: ${env.TRACING_SAMPLING_RATE}
# Optional. Use a parent based sampler. This enables remote spans help make a decision on if a span is sampeld or not.
# https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#parentbased
parent_based_sampler: false
# Optional limits
# max_attributes_per_event: 10
# max_attributes_per_link: 10
# max_attributes_per_span: 10
# max_events_per_span: 10
# max_links_per_span: 10
# Attributes particular to an exporter that have not
# been explicitly handled in Router configuration.
#attributes:
# some.config.attribute: "config value"
propagation:
# https://www.w3.org/TR/baggage/
baggage: false
# https://www.datadoghq.com/
datadog: true
# https://www.jaegertracing.io/ (compliant with opentracing)
jaeger: false
# https://www.w3.org/TR/trace-context/
trace_context: false
# https://zipkin.io/ (compliant with opentracing)
zipkin: false
# If you have your own way to generate a trace id and you want to pass it via a custom request header
#request:
# header_name: my-trace-id
datadog:
endpoint: default
enabled: true
batch_processor:
scheduled_delay: 10s
max_concurrent_exports: 300
max_export_batch_size: 10000
max_export_timeout: 30s
max_queue_size: 2000000
traffic_shaping:
deduplicate_variables: true # Enable the variable deduplication optimization.
router: # Rules applied to requests from clients to the router
# global_rate_limit: # Accept a maximum of 10 requests per 5 secs. Excess requests must be rejected.
# capacity: 10
# interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
timeout: 8s # If a request to the router takes more than 8secs then cancel the request (30 sec by default)
all:
deduplicate_query: true # Enable query deduplication for all subgraphs.
timeout: 7s
# compression: gzip # Enable brotli compression for all subgraphs.
# subgraphs: # Rules applied to requests from the router to individual subgraphs
# products:
# deduplicate_query: false # Disable query deduplication for the products subgraph.
# compression: gzip # Enable gzip compression only for the products subgraph.
# global_rate_limit: # Accept a maximum of 10 requests per 5 secs from the router. Excess requests must be rejected.
# capacity: 10
# interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
# timeout: 50s # If a request to the subgraph 'products' takes more than 50secs then cancel the request (30 sec by default)
# ...
Hi Team, We used the following config for router.yaml (without any tracing configuration), yet we observed a similar behaviour, where, the latency is spiking at a threshold of 160k RPM. Attached is the screen-grab for load and latency on a C5-2x Large EC2 instance.
rhai:
scripts: "./rhai_scripts"
main: "header.rhai"
plugins:
graphql_router.super_graph:
message: "starting supergraph plugin"
supergraph:
query_planning:
warmed_up_queries: 1000
experimental_cache:
in_memory:
limit: 100000
listen: 0.0.0.0:7777
path: "/*"
introspection: ${env.IS_LOWER_ENV}
sandbox:
enabled: ${env.IS_LOWER_ENV}
homepage:
enabled: false
health_check:
listen: 0.0.0.0:7777
enabled: true
include_subgraph_errors:
all: true # Propagate errors from all subraphs
headers:
all:
request:
- propagate:
matching: .*
traffic_shaping:
deduplicate_variables: true # Enable the variable deduplication optimization.
router: # Rules applied to requests from clients to the router
# global_rate_limit: # Accept a maximum of 10 requests per 5 secs. Excess requests must be rejected.
# capacity: 10
# interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
timeout: 8s # If a request to the router takes more than 8secs then cancel the request (30 sec by default)
all:
deduplicate_query: true # Enable query deduplication for all subgraphs.
timeout: 7s
# compression: gzip # Enable brotli compression for all subgraphs.
# subgraphs: # Rules applied to requests from the router to individual subgraphs
# products:
# deduplicate_query: false # Disable query deduplication for the products subgraph.
# compression: gzip # Enable gzip compression only for the products subgraph.
# global_rate_limit: # Accept a maximum of 10 requests per 5 secs from the router. Excess requests must be rejected.
# capacity: 10
# interval: 5s # Must not be greater than 18_446_744_073_709_551_615 milliseconds and not less than 0 milliseconds
# timeout: 50s # If a request to the subgraph 'products' takes more than 50secs then cancel the request (30 sec by default)
p99 Latency for 160k RPM
160k RPM throughput on ALB
Max CPU usage
@garypen