Collect timeout metrics for call attempts
Describe the feature
I propose that the SDK collects timeout metrics for call attempts.
Current docs for available metrics, as reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html
Use Case
The SDK allows users to tweak timeout values. Users can utilize the existing latency metrics to inform their decision about appropriate timeout values in the SDK, but it would be very helpful to have dedicated timeout metrics to see the direct effect of any changes.
Proposed Solution
No response
Other Information
I tried using an ExecutionInterceptor, which offers hooks into different points in the request/response lifecycle, with the intention of creating the timeout metric myself.
It seemed promising but unfortunately it couldn't make it work:
- response hooks don't run because the timeout is supposed to trigger before the response is available
- exception hooks only run when the request as a whole fails, giving you the final failure that is thrown, but doesn't run for each underlying call attempt, and also doesn't keep track of all failures that caused each retry
Acknowledgements
- [ ] I may be able to implement this feature request
- [ ] This feature might incur a breaking change
AWS Java SDK version used
v2
JDK version used
Corretto 11
Operating System and version
Amazon Linux 2
@ebautistabar thank you for reaching out, we've added this to our backlog.
@ebautistabar the team have some follow-up questions:
- the metric you want to publish is the timeout value? It was not very clear which metric you're talking about.
- does it work if you use ExecutionInterceptor to publish the value with every request, instead of only publishing when the timeout is triggered or when an exception is thrown?
@debora-ito
Apologies, I didn't explain myself clearly. I do not want the timeout value.
I will explain by following the same format as the tables in https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html
Metrics collected for each request attempt
- Metric name: Timeout
- Description: True if this API call attempt timed out; false if not
- Type: Boolean
- Collected by default?: Yes
Now let's go over an example.
Let's say I make a call to DynamoDB which results in three call attempts:
- The first call attempt fails with a 500
- The second call attempt fails with a timeout
- The third call attempt succeeds
The new Timeout metric would have the following values for each call attempt:
- In the first call attempt, Timeout=false
- In the second call attempt, Timeout=true
- In the third call attempt, Timeout=false
This new metric will allow customers to easily graph the amount of timeouts in their dashboards.
Got it now, thanks for clarifying!
In that case we don't have plans to add that specific metric to the default set of SDK client-side metrics. We should make it possible for you to publish the metric for yourself using ExecutionInterceptors, though, as it looks like it's not possible right now. We have a feature request in the backlog that might fulfill your use case - https://github.com/aws/aws-sdk-java-v2/issues/929.
@debora-ito
Thanks for your response.
Just to check if I understand correctly: a new method will be added to the ExecutionInterceptor interface with a signature similar to this
void onExecutionAttemptFailure(Context.FailedExecutionAttempt context,
ExecutionAttributes executionAttributes)
- The method will be invoked for each request attempt failure
-
contextwill contain the exception for the failure of the request attempt -
executionAttributeswill contain some kind of Metrics object into which clients will be able to add their own metrics
Is my interpretation correct? If so, I'm aligned with the approach, and I'm fine with tracking progress on #929
I think that adding a new hook for request attempt failures is a good idea regardless, brings more value and has broader appeal than just adding a new metric.
Follow up questions:
- do you document somewhere what types of exception are thrown when there are timeouts?
- #929 was opened in 2018 and there doesn't seem to be any progress; is there any ETA for when the task will be picked up? Anything I can do to help prioritize it or get it done faster?
Thanks in advance.