aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

Collect timeout metrics for call attempts

Open ebautistabar opened this issue 3 years ago • 5 comments

Describe the feature

I propose that the SDK collects timeout metrics for call attempts.

Current docs for available metrics, as reference: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html

Use Case

The SDK allows users to tweak timeout values. Users can utilize the existing latency metrics to inform their decision about appropriate timeout values in the SDK, but it would be very helpful to have dedicated timeout metrics to see the direct effect of any changes.

Proposed Solution

No response

Other Information

I tried using an ExecutionInterceptor, which offers hooks into different points in the request/response lifecycle, with the intention of creating the timeout metric myself.

It seemed promising but unfortunately it couldn't make it work:

  • response hooks don't run because the timeout is supposed to trigger before the response is available
  • exception hooks only run when the request as a whole fails, giving you the final failure that is thrown, but doesn't run for each underlying call attempt, and also doesn't keep track of all failures that caused each retry

Acknowledgements

  • [ ] I may be able to implement this feature request
  • [ ] This feature might incur a breaking change

AWS Java SDK version used

v2

JDK version used

Corretto 11

Operating System and version

Amazon Linux 2

ebautistabar avatar Jun 09 '22 09:06 ebautistabar

@ebautistabar thank you for reaching out, we've added this to our backlog.

debora-ito avatar Jun 13 '22 19:06 debora-ito

@ebautistabar the team have some follow-up questions:

  • the metric you want to publish is the timeout value? It was not very clear which metric you're talking about.
  • does it work if you use ExecutionInterceptor to publish the value with every request, instead of only publishing when the timeout is triggered or when an exception is thrown?

debora-ito avatar Jun 15 '22 00:06 debora-ito

@debora-ito

Apologies, I didn't explain myself clearly. I do not want the timeout value.

I will explain by following the same format as the tables in https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html

Metrics collected for each request attempt

  • Metric name: Timeout
  • Description: True if this API call attempt timed out; false if not
  • Type: Boolean
  • Collected by default?: Yes

Now let's go over an example.

Let's say I make a call to DynamoDB which results in three call attempts:

  • The first call attempt fails with a 500
  • The second call attempt fails with a timeout
  • The third call attempt succeeds

The new Timeout metric would have the following values for each call attempt:

  • In the first call attempt, Timeout=false
  • In the second call attempt, Timeout=true
  • In the third call attempt, Timeout=false

This new metric will allow customers to easily graph the amount of timeouts in their dashboards.

ebautistabar avatar Jun 15 '22 07:06 ebautistabar

Got it now, thanks for clarifying!

In that case we don't have plans to add that specific metric to the default set of SDK client-side metrics. We should make it possible for you to publish the metric for yourself using ExecutionInterceptors, though, as it looks like it's not possible right now. We have a feature request in the backlog that might fulfill your use case - https://github.com/aws/aws-sdk-java-v2/issues/929.

debora-ito avatar Jun 17 '22 22:06 debora-ito

@debora-ito

Thanks for your response.

Just to check if I understand correctly: a new method will be added to the ExecutionInterceptor interface with a signature similar to this

void onExecutionAttemptFailure(Context.FailedExecutionAttempt context,
                               ExecutionAttributes executionAttributes)
  • The method will be invoked for each request attempt failure
  • context will contain the exception for the failure of the request attempt
  • executionAttributes will contain some kind of Metrics object into which clients will be able to add their own metrics

Is my interpretation correct? If so, I'm aligned with the approach, and I'm fine with tracking progress on #929

I think that adding a new hook for request attempt failures is a good idea regardless, brings more value and has broader appeal than just adding a new metric.

Follow up questions:

  • do you document somewhere what types of exception are thrown when there are timeouts?
  • #929 was opened in 2018 and there doesn't seem to be any progress; is there any ETA for when the task will be picked up? Anything I can do to help prioritize it or get it done faster?

Thanks in advance.

ebautistabar avatar Jun 18 '22 13:06 ebautistabar