fhir-works-on-aws-deployment icon indicating copy to clipboard operation
fhir-works-on-aws-deployment copied to clipboard

[Misc] definition of SubscriptionEndToEndLatency metric

Open Zambonilli opened this issue 2 years ago • 4 comments

What's on your mind? We're investigating a complaint about the overall latency of subscriptions by our clients. In investigating the latency, we looked at the SubscriptionEndToEndLatency custom metric being reported by FWoA to cloudwatch. We see that this metric is the fhir resource's lastUpdated snapshot value to the current time right before the send. Which is good, however our private forks value for this metric is skewed really high and became unusable because we have increased our retries and visibility timeouts to give more reliability and the failing sends out weigh the successful sends.

So, is this metric meant to be the milliseconds from write to 1st send attempt or milliseconds from write to successful send? In addition, is there a reason that tenantId and/or endpoint are not included in the dimensions to be able to see the latency broken down a bit?

Versions (please complete the following information): v2.5.1-smart

Zambonilli avatar Aug 17 '22 15:08 Zambonilli

Hi Mike,

Thanks for reaching out!

The metric is meant to be the milliseconds from write to the 1st send attempt, as the system has no control over the external HTTP endpoint latency. Are the metrics behaving in other unexpected ways?

Theoretically the latency should not be affected by tenantid or endpoint, so it was not considered when designing this metric. It shouldn't be difficult to include both in the metric, do you see the the need to include them?

Thanks, Yanyu

Bingjiling avatar Aug 19 '22 16:08 Bingjiling

Thanks, makes sense that this would be from write to 1st send attempt. Also, makes sense why tenantId and subscription are not dimensions since there's no differing infrastructure or code on those values.

We're seeing that this metric is being updated on subsequent receives of the same subscription notification and artificially ramping this metric. For example, we've updated our private fork to have 5m visibility timeout and max receives out long enough for 72h. So, a bad subscription notification updates this metric each 5m and skews the metric upwards.

It should be easy enough to send this metric only after checking the approximate receive count to see if it's the 1st receive. I can get a PR ready here.

Zambonilli avatar Aug 19 '22 17:08 Zambonilli

Actually, I'm going to wait until the result of https://github.com/awslabs/fhir-works-on-aws-deployment/issues/676 before starting a PR for this one because there is control flow changes in the https://github.com/awslabs/fhir-works-on-aws-deployment/issues/676 that are working with SQSEvent as opposed to SubscriptionNotification

Zambonilli avatar Aug 19 '22 18:08 Zambonilli

Thanks for the detailed explanation! And looking forward to you PR addressing this issue.

Bingjiling avatar Aug 22 '22 14:08 Bingjiling

FHIR Works on AWS has been moved to maintenance mode. While in maintenance, we will not add any new features to this solution. All security issues should be reported directly to AWS Security at [[email protected]] (mailto:[email protected]). If you are new to this solution, we advise you to explore using [HealthLake] (https://aws.amazon.com/healthlake), which is our managed service for building FHIR based transactional and analytics applications. You can get started by contacting your AWS Account team. If you are an existing customer of FHIR Works on AWS, and have additional questions or need immediate help, please reach out to [email protected] or contact your AWS Account team.

nisankep avatar Apr 03 '23 22:04 nisankep