router Fix cache key hashing algorithm

Fix https://github.com/apollographql/router/issues/5160

TL;DR: This reintroduces the schema aware query hashing algorithm with a configuration option to enable it for query plan reuse in warmup, and some performance and reliability fixes around the cache key generation.

Problem

The router implements a query hashing algorithm that takes the schema into account, so that if the parts of the schema relevant to the query do not change across a schema update (example: a field was added to an unrelated type), then the hash will not change. This is useful in 2 ways:

when updating the schema, query plans must be regenerated for existing queries. We can use the supergraph.query_planning.warmed_up_queries option to pregenerate query plans before switching traffic to the new schema. But that can result in a lot of queries being planned, a lot of CPU usage, and schema updates taking a long time. This query hashing algorithm can be used to detect that a schema update would not affect a query, so we can reuse its query plan, and avoid regenrating most of the query plans. This is the goal of the experimental_reuse_query_plans option.
in entity caching, we hash the query as part of the cache key, but that is not enough, we need to check that relevant parts of the schema do not change. As an example: if the type of a field changes, we don't want to keep serving data with the wrong type. But we also do not want to rebuild the entire cache from scratch every time the schema updates. By using this query hashing algorithm, we can reuse most of the entity cache across schema updates

Proposed solution

An earlier version of this algorithm shipped in a router release with some significant issues that broke query plan caching, so it was promptly deactivated. This PR reintroduces the algorithm, with the experimental_reuse_query_plans set by default to do_not_reuse, making its use in the query planner cache optional. It will generate the query hash in all cases, and use it with the schema hash (a hash of the schema SDL as bytes), but when experimental_reuse_query_plans is set to "reuse", only the query hash will be employed. When the option is set to measure, it measures the number of query plans that could have been reused by comparing old and new query plans, in the same way we do with JS and Rust planners. It also brings significant improvements to the query planner cache handling, using prefixes to make the Redis key self describing, reducing the amount of data serialization when generating the cache key, and making sure the rust planner's configuration is used in the cache key.

TODO:

[x] ~configuration option to reactivate the schema hash~ the hash is now always generated, but the cache key will by default use that query hash and the schema hash. If the experimental_reuse_query_plans option is enabled, then it will not use the schema hash
[x] metric to measure the number of query plans that could have been reused during warmup
[x] measure how many of the query plans that could have been reused are actually the same
[x] ~check with the federation version for the in memory cache~ we don't need to check it because a running router instance will only have one federation version
[x] remove the custom Hash implementation for the caching key
[x] use prefixes for each part of the redis cache key so they become self describing
[x] remove JSON serialization
[x] hash the Rust planner's config once a new version has been released (it implements Hash now)
[ ] configuration migration for the experimental_reuse_query_plans option from a boolean to an enum

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

[ ] Changes are compatible[^1]
[x] Documentation[^2] completed
[x] Performance impact assessed and acceptable
Tests added and passing[^3]
- [x] Unit Tests
- [ ] Integration Tests
- [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

May 27 '24 13:05 Geal

@Geal, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

May 27 '24 13:05 github-actions[bot]

CI performance tests

[ ] events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
[ ] reload - Reload test over a long period of time at a constant rate of users
[ ] large-request - Stress test with a 1 MB request payload
[ ] events - Stress test for events with a lot of users and deduplication ENABLED
[x] const - Basic stress test that runs with a constant number of users
[x] step - Basic stress test that steps up the number of users over time
[ ] events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
[ ] demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
[ ] events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
[ ] events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
[ ] no-graphos - Basic stress test, no GraphOS.
[ ] events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
[ ] xlarge-request - Stress test with 10 MB request payload
[ ] xxlarge-request - Stress test with 100 MB request payload
[ ] step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
[ ] step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
[ ] demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled

May 27 '24 13:05 router-perf[bot]

I'm looking at coverage, we're already at 96.94% of the lines being tested, I'm adding tests for the remaining ones

Jul 31 '24 14:07 Geal

now with bf734b92c we are now at a coverage of 99.40% (1826 lines out of 1837) That should be enough for now

Jul 31 '24 16:07 Geal

@IvanGoncharov good call on hashing the interface implementors, this is done in 555e507 Now I am wondering about this //FIXME: hash implemented interfaces? https://github.com/apollographql/router/pull/5255/files#diff-b21dc8758c509f67655b816231a0ac2c2fd978c90b0ced9ce88a52fc74324e12R285 I think we should hash the interfaces implemented by an object, but don't see any failure case right now. Maybe with things like authorization, where there's some behaviour that needs to match between interface and implementor

Aug 29 '24 15:08 Geal

✅ Docs Preview Ready

No new or changed pages found.

Oct 18 '24 10:10 svc-apollo-docs

now that https://github.com/apollographql/router/pull/6206 is merged into dev, 59d1e2c merges back those changes here.

Nov 05 '24 17:11 Geal

Converting to draft due to internal discussion regarding the risks introduced by this PR. We want to add "Query hash stable to incremental schema changes" as the Router's feature. However, in its current form, it is implemented outside of Query Planner and without any mechanism to determine what parts of the schema are used by Query Planner.

For example, adding new directives that influence Query planning without updating this code can result in a hash collision where a schema update changes the query plan, but the hash remains the same.

Nov 26 '24 20:11 IvanGoncharov

Closing as this is a year stale

Mar 28 '25 09:03 BrynCooke

router router copied to clipboard

Fix cache key hashing algorithm

Problem

Proposed solution

✅ Docs Preview Ready

router
router copied to clipboard