fhir-works-on-aws-deployment
fhir-works-on-aws-deployment copied to clipboard
[Feature Request] Mitigate Head of Line Block in Subscriptions Sends
Is your feature request related to a problem? Please describe. We're investigating a report by our clients of high latency on subscription sends and have been able to narrow down a large chunk of the overall latency to good endpoints waiting on sends to bad endpoints in the subscriptionsRestHook lambda. For example, if there are 10 bad endpoints in the SQS records array inserted before a good endpoint then the good endpoint is waiting at least the timeout value of 5s for the bad endpoints to finish before they are sent.
We've upped our retries and visibility timeout for failed sends to give more reliability to endpoints, which exacerbates the level of head of line blocking we're seeing as well. However, the problem exists even in the out of the box FWoA implementation.
Describe the solution you'd like
Leverage the ApproximateReceiveCount
SQS message attribute to sort the SQS records ascending by the number of receives per distinct endpoint. This puts endpoints with low number of retries first in the send array and vice-versa bad endpoints at the end of the send array.
Describe alternatives you've considered
- Increasing the concurrent sends from 10 to all 50
- Keeping state of a histogram of send statistics per endpoint and implementing a priority to endpoints and/or a circuit breaker.
- implementing a smarter backoff than the linear backoff SQS retries provide.
Additional context Happy to provide a PR here.
Hi Mike,
Thanks for sharing the details on your investigation in subscription latency! I'm wondering with the sorting strategy, is it possible that bad endpoints
never gets processed again after failures?
Thanks, Yanyu
This PR doesn't change the behavior of bad endpoint sends other than ordering them most likely behind good endpoints in a single batch. However, with enough bad subscription notifications in the queue and the currently configured 5s timeout on sends, 10x sends concurrently and a 10s timeout on the lambda it's possible that now bad endpoints will be skipped as opposed to the good endpoints previously. For example, if there are 30 bad subscription notifications in the receive batch the current configuration would process the first 20, 10 at a time, and then the lambda would timeout not processing the last 10. The only thing this logic does is most likely move good subscription notifications before the 30 bad ones so we can clear the pipes more efficiently each batch.
Not sure why this PR isn't linking btw, https://github.com/awslabs/fhir-works-on-aws-deployment/pull/677
FHIR Works on AWS has been moved to maintenance mode. While in maintenance, we will not add any new features to this solution. All security issues should be reported directly to AWS Security at [[email protected]] (mailto:[email protected]). If you are new to this solution, we advise you to explore using [HealthLake] (https://aws.amazon.com/healthlake), which is our managed service for building FHIR based transactional and analytics applications. You can get started by contacting your AWS Account team. If you are an existing customer of FHIR Works on AWS, and have additional questions or need immediate help, please reach out to [email protected] or contact your AWS Account team.