router transient in memory cache

This adds another level of in memory cache, to mitigate issues with unique or infrequent queries pushing frequently used queries out of the in memory cache. If a query has only been seen once, then it is stored inthe transient cache (short term, small number of entries, LRU). If that query is requested again, and is still present in the transient cache, then it is added to the long term, larger in memory cache, to the redis cache as well, and removed from the transient cache. If that query is requested again but not present in the transient cache, then we test the larger in memory cache and redis

Description here

Fixes #issue_number

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

[ ] Changes are compatible[^1]
[ ] Documentation[^2] completed
[ ] Performance impact assessed and acceptable
Tests added and passing[^3]
- [ ] Unit Tests
- [ ] Integration Tests
- [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Apr 05 '24 13:04 Geal

@Geal, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

Apr 05 '24 13:04 github-actions[bot]

CI performance tests

[ ] reload - Reload test over a long period of time at a constant rate of users
[ ] events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
[ ] events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
[ ] large-request - Stress test with a 1 MB request payload
[x] const - Basic stress test that runs with a constant number of users
[ ] no-graphos - Basic stress test, no GraphOS.
[ ] step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
[ ] events - Stress test for events with a lot of users and deduplication ENABLED
[ ] events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
[ ] events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
[ ] events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
[ ] xxlarge-request - Stress test with 100 MB request payload
[ ] xlarge-request - Stress test with 10 MB request payload
[x] step - Basic stress test that steps up the number of users over time

Apr 05 '24 13:04 router-perf[bot]

We need to do something in this space, but I feel we can do something better when #4796 lands.

I imagine that with the new rust support we could do something like:

(query from client) -> new functionality in 4796 -> (normalised form)

Then use (normalised form) as the key to the query plan.

If the conversion to (normalised form) is expensive we could introduce a smaller cache to hold (query from client) -> (normalised form), but I imagine that shouldn't be required.

Note: Some details (extra items in the key such as metadata) are omitted for brevity, but wouldn't this basic approach work?

Apr 10 '24 10:04 garypen

That's definitely something we could do. And we already have the right cache for that, in query analysis https://github.com/apollographql/router/pull/4796#discussion_r1558999301

Apr 10 '24 13:04 Geal

Note that right now the normalised form is stripping out some important information like alias names and input objects, but I'm planning to add support for including those things as part of PULSR-695.

Apr 11 '24 01:04 bonnici

@bonnici for some context, we recently merged https://github.com/apollographql/router/pull/4883 which uses a hashing scheme for the query that keeps the same hash across schema updates, if the update does not affect the query. Unfortunately, we needed to add back the operation name to the query plan cache key (https://github.com/apollographql/router/pull/4921), because the query planner was returning the usage reporting structure as well, and the operation signature contains the operation name, so with the operation name in the cache key, we would have not reported operations properly. But now, with the usage reporting and operation signature generation happening on the rust side, we will be able to move that task entirely to the query analysis layer (ase we're doing with validation in https://github.com/apollographql/router/pull/4551), which executes much earlier. And in that layer, we could also have a transformation layer to normalize the operation name, extract hardcoded variables, etc, so that we reduce further the number of queries sent to planning. And the client side would see no difference because the responnse formatting is done according to the original query

Apr 11 '24 07:04 Geal

I've converted this to a draft so we can discuss it further, but it won't be on our review backlog for now.

May 08 '24 09:05 abernix

@abernix it was in the review queue because I expected some reviews :/

May 13 '24 08:05 Geal

Before looking too much into the implementation of this, do we still think this is the right solution for this problem?

My comment above still seems applicable. More so, since various changes have landed that make "normalised form" a more realistic prospect.

In general, this feels like a "bigger" decision than code review and could use some more substantial design reviewing before proceeding to code review. Perhaps we could discuss this in the next router architectural working group meeting? One thing we definitely want to discuss is whether we should expose configuration of this feature.

May 13 '24 08:05 garypen

normalization is a much larger issue than this, and not that much related. I agree that it needs further discussion though, I added it to the topics of the meeting

May 13 '24 10:05 Geal

some interesting context: https://blog.jasony.me/system/cache/2023/06/24/fifo-lru

May 30 '24 14:05 Geal

router router copied to clipboard

transient in memory cache

router
router copied to clipboard