router
router copied to clipboard
Fix cache key hashing algorithm
Fix https://github.com/apollographql/router/issues/5160
TL;DR: This reintroduces the schema aware query hashing algorithm with a configuration option to enable it for query plan reuse in warmup, and some performance and reliability fixes around the cache key generation.
Problem
The router implements a query hashing algorithm that takes the schema into account, so that if the parts of the schema relevant to the query do not change across a schema update (example: a field was added to an unrelated type), then the hash will not change. This is useful in 2 ways:
- when updating the schema, query plans must be regenerated for existing queries. We can use the
supergraph.query_planning.warmed_up_queriesoption to pregenerate query plans before switching traffic to the new schema. But that can result in a lot of queries being planned, a lot of CPU usage, and schema updates taking a long time. This query hashing algorithm can be used to detect that a schema update would not affect a query, so we can reuse its query plan, and avoid regenrating most of the query plans. This is the goal of theexperimental_reuse_query_plansoption. - in entity caching, we hash the query as part of the cache key, but that is not enough, we need to check that relevant parts of the schema do not change. As an example: if the type of a field changes, we don't want to keep serving data with the wrong type. But we also do not want to rebuild the entire cache from scratch every time the schema updates. By using this query hashing algorithm, we can reuse most of the entity cache across schema updates
Proposed solution
An earlier version of this algorithm shipped in a router release with some significant issues that broke query plan caching, so it was promptly deactivated. This PR reintroduces the algorithm, with the experimental_reuse_query_plans set by default to do_not_reuse, making its use in the query planner cache optional. It will generate the query hash in all cases, and use it with the schema hash (a hash of the schema SDL as bytes), but when experimental_reuse_query_plans is set to "reuse", only the query hash will be employed. When the option is set to measure, it measures the number of query plans that could have been reused by comparing old and new query plans, in the same way we do with JS and Rust planners.
It also brings significant improvements to the query planner cache handling, using prefixes to make the Redis key self describing, reducing the amount of data serialization when generating the cache key, and making sure the rust planner's configuration is used in the cache key.
TODO:
- [x] ~configuration option to reactivate the schema hash~ the hash is now always generated, but the cache key will by default use that query hash and the schema hash. If the
experimental_reuse_query_plansoption is enabled, then it will not use the schema hash - [x] metric to measure the number of query plans that could have been reused during warmup
- [x] measure how many of the query plans that could have been reused are actually the same
- [x] ~check with the federation version for the in memory cache~ we don't need to check it because a running router instance will only have one federation version
- [x] remove the custom Hash implementation for the caching key
- [x] use prefixes for each part of the redis cache key so they become self describing
- [x] remove JSON serialization
- [x] hash the Rust planner's config once a new version has been released (it implements Hash now)
- [ ] configuration migration for the
experimental_reuse_query_plansoption from a boolean to an enum
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
- [ ] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
- [x] Unit Tests
- [ ] Integration Tests
- [ ] Manual Tests
Exceptions
Note any exceptions here
Notes
[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.
@Geal, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.
CI performance tests
- [ ] events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
- [ ] reload - Reload test over a long period of time at a constant rate of users
- [ ] large-request - Stress test with a 1 MB request payload
- [ ] events - Stress test for events with a lot of users and deduplication ENABLED
- [x] const - Basic stress test that runs with a constant number of users
- [x] step - Basic stress test that steps up the number of users over time
- [ ] events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
- [ ] demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
- [ ] events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
- [ ] events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
- [ ] no-graphos - Basic stress test, no GraphOS.
- [ ] events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
- [ ] xlarge-request - Stress test with 10 MB request payload
- [ ] xxlarge-request - Stress test with 100 MB request payload
- [ ] step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
- [ ] step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
- [ ] demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
I'm looking at coverage, we're already at 96.94% of the lines being tested, I'm adding tests for the remaining ones
now with bf734b92c we are now at a coverage of 99.40% (1826 lines out of 1837) That should be enough for now
@IvanGoncharov good call on hashing the interface implementors, this is done in 555e507
Now I am wondering about this //FIXME: hash implemented interfaces? https://github.com/apollographql/router/pull/5255/files#diff-b21dc8758c509f67655b816231a0ac2c2fd978c90b0ced9ce88a52fc74324e12R285
I think we should hash the interfaces implemented by an object, but don't see any failure case right now. Maybe with things like authorization, where there's some behaviour that needs to match between interface and implementor
✅ Docs Preview Ready
No new or changed pages found.
now that https://github.com/apollographql/router/pull/6206 is merged into dev, 59d1e2c merges back those changes here.
Converting to draft due to internal discussion regarding the risks introduced by this PR. We want to add "Query hash stable to incremental schema changes" as the Router's feature. However, in its current form, it is implemented outside of Query Planner and without any mechanism to determine what parts of the schema are used by Query Planner.
For example, adding new directives that influence Query planning without updating this code can result in a hash collision where a schema update changes the query plan, but the hash remains the same.
Closing as this is a year stale