router icon indicating copy to clipboard operation
router copied to clipboard

Enough functionality to implement adaptive load shedding

Open garypen opened this issue 1 year ago • 2 comments

An investigation into backpressure issues in the router.

Most of the changes are in various plugins to implement backpressure. However, those fixes are not enough to provide useful functionality...

The current implementation of the router create a new pipeline for each connection. This has the unfortunate impact of discarding state which is required for various load impacting layers to work correctly.

This exploration modifies the router to hold a single master pipeline which is clone'd for each connection. This allows the various tower connection limiting layers to work correctly.

I've got a version which works with standard tower layers, commented out here, but I've also got a potentially more interesting version which uses a load shedded based on Little's Law, which is what is active in this code.

Notes:

Modifying the pipeline to be cloneable has generally worked fine, but it has caused issues for the Limit layer. This layer looks "generally problematic" since it appears to make a number of assumptions about what request rejection actually means. I've done some minimal modification to try and make it work win a cloned pipeline, but tests are still failing and I'm not sure it does what it should do.

I also noticed that when implementing backpressure, various mock tests needed to be modified since test rejection happened earlier in the pipeline and a map_result() somewhere isn't triggered. That needs some investigation, but I think it's a small problem to address.

I modified the bridge query planner pool to prevent excessive queueing in this layer. Since I now want to control this by load_shedding before this service is reached, I only want enough channels to support the number of planners.

Summary:

This PR provides a router which will operate with approximately the same performance of the base router, but which controls memory and rejects excess load to prevent "over-commit" by the router. This is a very desirable property.

More testing is required, but this is looking promising so far.

Description here

Fixes #issue_number


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • [ ] Changes are compatible[^1]
  • [ ] Documentation[^2] completed
  • [ ] Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • [ ] Unit Tests
    • [ ] Integration Tests
    • [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

garypen avatar Oct 14 '24 09:10 garypen

✅ Docs Preview Ready

No new or changed pages found.

svc-apollo-docs avatar Oct 14 '24 09:10 svc-apollo-docs

@garypen, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

github-actions[bot] avatar Oct 14 '24 09:10 github-actions[bot]

Closing since most of this work was re-purposed and merged into 2.0.0.

garypen avatar Feb 11 '25 13:02 garypen