router
router copied to clipboard
transient in memory cache
This adds another level of in memory cache, to mitigate issues with unique or infrequent queries pushing frequently used queries out of the in memory cache. If a query has only been seen once, then it is stored inthe transient cache (short term, small number of entries, LRU). If that query is requested again, and is still present in the transient cache, then it is added to the long term, larger in memory cache, to the redis cache as well, and removed from the transient cache. If that query is requested again but not present in the transient cache, then we test the larger in memory cache and redis
Description here
Fixes #issue_number
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
- [ ] Unit Tests
- [ ] Integration Tests
- [ ] Manual Tests
Exceptions
Note any exceptions here
Notes
[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.
@Geal, please consider creating a changeset entry in /.changesets/
. These instructions describe the process and tooling.
CI performance tests
- [ ] reload - Reload test over a long period of time at a constant rate of users
- [ ] events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
- [ ] events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
- [ ] large-request - Stress test with a 1 MB request payload
- [x] const - Basic stress test that runs with a constant number of users
- [ ] no-graphos - Basic stress test, no GraphOS.
- [ ] step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
- [ ] events - Stress test for events with a lot of users and deduplication ENABLED
- [ ] events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
- [ ] events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
- [ ] events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
- [ ] xxlarge-request - Stress test with 100 MB request payload
- [ ] xlarge-request - Stress test with 10 MB request payload
- [x] step - Basic stress test that steps up the number of users over time
We need to do something in this space, but I feel we can do something better when #4796 lands.
I imagine that with the new rust support we could do something like:
(query from client) -> new functionality in 4796 -> (normalised form)
Then use (normalised form) as the key to the query plan.
If the conversion to (normalised form) is expensive we could introduce a smaller cache to hold (query from client) -> (normalised form), but I imagine that shouldn't be required.
Note: Some details (extra items in the key such as metadata) are omitted for brevity, but wouldn't this basic approach work?
That's definitely something we could do. And we already have the right cache for that, in query analysis https://github.com/apollographql/router/pull/4796#discussion_r1558999301
Note that right now the normalised form is stripping out some important information like alias names and input objects, but I'm planning to add support for including those things as part of PULSR-695.
@bonnici for some context, we recently merged https://github.com/apollographql/router/pull/4883 which uses a hashing scheme for the query that keeps the same hash across schema updates, if the update does not affect the query. Unfortunately, we needed to add back the operation name to the query plan cache key (https://github.com/apollographql/router/pull/4921), because the query planner was returning the usage reporting structure as well, and the operation signature contains the operation name, so with the operation name in the cache key, we would have not reported operations properly. But now, with the usage reporting and operation signature generation happening on the rust side, we will be able to move that task entirely to the query analysis layer (ase we're doing with validation in https://github.com/apollographql/router/pull/4551), which executes much earlier. And in that layer, we could also have a transformation layer to normalize the operation name, extract hardcoded variables, etc, so that we reduce further the number of queries sent to planning. And the client side would see no difference because the responnse formatting is done according to the original query
I've converted this to a draft so we can discuss it further, but it won't be on our review backlog for now.
@abernix it was in the review queue because I expected some reviews :/
Before looking too much into the implementation of this, do we still think this is the right solution for this problem?
My comment above still seems applicable. More so, since various changes have landed that make "normalised form" a more realistic prospect.
In general, this feels like a "bigger" decision than code review and could use some more substantial design reviewing before proceeding to code review. Perhaps we could discuss this in the next router architectural working group meeting? One thing we definitely want to discuss is whether we should expose configuration of this feature.
normalization is a much larger issue than this, and not that much related. I agree that it needs further discussion though, I added it to the topics of the meeting
some interesting context: https://blog.jasony.me/system/cache/2023/06/24/fifo-lru