aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

ApiCallTimeoutException fails to include exception of last failure; loses root cause and so breaks recovery logic of applications.

Open steveloughran opened this issue 1 year ago • 11 comments

Describe the bug

We see this when the S3 client is trying to use S3Express CreateSession and is configured such that it is doing so many retries (10) that the call times out before the retry limit is reached. Rather than include the underlying exception triggering the retries, a simpler "call timeout out" exception is raised with the suppressed exception of "java.lang.RuntimeException: Task failed."

I believe this is a regression from v1 sdk.

Expected Behavior

ApiCallTimeoutException to include the exception triggering retries internally.

Current Behavior

See HADOOP-19000 for this surfacing connecting to S3Express buckets.

Stack trace on timeouts contains no root cause information.

Caused by: software.amazon.awssdk.core.exception.ApiCallTimeoutException: Client execution did not complete before the specified timeout configuration: 10000 millis
        at software.amazon.awssdk.core.exception.ApiCallTimeoutException$BuilderImpl.build(ApiCallTimeoutException.java:97)
        at software.amazon.awssdk.core.exception.ApiCallTimeoutException.create(ApiCallTimeoutException.java:38)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.generateApiCallTimeoutException(ApiCallTimeoutTrackingStage.java:151)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.handleInterruptedException(ApiCallTimeoutTrackingStage.java:139)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.translatePipelineException(ApiCallTimeoutTrackingStage.java:107)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:62)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
        at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
        at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at software.amazon.awssdk.services.s3.DefaultS3Client.createSession(DefaultS3Client.java:1589)
        at software.amazon.awssdk.services.s3.S3Client.createSession(S3Client.java:2505)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.getCredentials(S3ExpressIdentityCache.java:88)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.lambda$getCachedCredentials$0(S3ExpressIdentityCache.java:73)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.refreshResult(CachedS3ExpressCredentials.java:91)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.lambda$new$0(CachedS3ExpressCredentials.java:70)
        at software.amazon.awssdk.utils.cache.CachedSupplier.lambda$jitteredPrefetchValueSupplier$8(CachedSupplier.java:300)
        at software.amazon.awssdk.utils.cache.NonBlocking.fetch(NonBlocking.java:151)
        at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:208)
        at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:135)
        at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.get(CachedS3ExpressCredentials.java:85)
        at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.get(S3ExpressIdentityCache.java:61)
        at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.lambda$resolveIdentity$0(DefaultS3ExpressIdentityProvider.java:56)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.resolveIdentity(DefaultS3ExpressIdentityProvider.java:49)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.trySelectAuthScheme(S3AuthSchemeInterceptor.java:142)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.selectAuthScheme(S3AuthSchemeInterceptor.java:81)
        at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.beforeExecution(S3AuthSchemeInterceptor.java:61)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$beforeExecution$1(ExecutionInterceptorChain.java:62)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.beforeExecution(ExecutionInterceptorChain.java:62)
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.runInitialInterceptors(AwsExecutionContextBuilder.java:239)
        at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:130)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.invokeInterceptorsAndCreateExecutionContext(AwsSyncClientHandler.java:67)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:76)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:7323)
        at software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$listObjectsV2$63(DelegatingS3Client.java:5856)
        at software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:73)
        at software.amazon.awssdk.services.s3.DelegatingS3Client.listObjectsV2(DelegatingS3Client.java:5856)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$13(S3AFileSystem.java:2963)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        ... 8 more
        Suppressed: java.lang.RuntimeException: Task failed.
                at software.amazon.awssdk.utils.CompletableFutureUtils.joinLikeSync(CompletableFutureUtils.java:254)
                at software.amazon.awssdk.auth.signer.AwsSignerExecutionAttribute.awsCredentialsReadMapping(AwsSignerExecutionAttribute.java:201)
                at software.amazon.awssdk.core.interceptor.ExecutionAttribute$DerivationValueStorage.get(ExecutionAttribute.java:260)
                at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getAttribute(ExecutionAttributes.java:53)
                at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getOptionalAttribute(ExecutionAttributes.java:68)
                at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:144)
                ... 22 more

Reproduction Steps

  • set retries to longer than api timeout
  • try to an operation which fail with a retryable exception (connectivity etc)
  • wait for failure

Possible Solution

We really need that underlying exception for our own decision-making about what to do next. I fear we are going to have to change the S3A retry policies so that we have special handling for the first failure of any S3 operation on the basis that this is a configuration problem that retries will not recover from. Thanks will add the overhead of a needless S3 call. Would it be possible to do something like save the innermost exception and add it as the root cause when throwing an ApiCallTimeoutException?

Additional Information/Context

No response

AWS Java SDK version used

2.21.33

JDK version used

openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)

Operating System and version

macos 13.4.1

steveloughran avatar Dec 04 '23 13:12 steveloughran

@steveloughran acknowledged. We'll think of a way to surface the underlying exception.

debora-ito avatar Jan 09 '24 02:01 debora-ito

@steveloughran after the change released in https://github.com/aws/aws-sdk-java-v2/issues/4779, do you still see this issue?

The idea is that since the timeouts are propagated now, this should be mitigated.

debora-ito avatar Jan 24 '24 19:01 debora-ito

I'd still like the reason for internal retry failures to be passed up. For example -client retries on all IOEs -endpoint is generating UnknownHostException -which is retried -timeout exception will say timeout; inner cause is lost

we deal with that in our code by having a very limited set of retries. if it weren't for the transfer manager I'd set the SDK retry count to 0 and handle it all ourselves

steveloughran avatar Jan 25 '24 22:01 steveloughran

This could be really helpful for us. We hit the 60 second timeout (even for tiny files in S3), and still have the same issue with a 120 second timeout.

We don't know the root cause, and it seems like #4057 - random failures that cause all future s3 interactions to fail.

I can't share a reproducible snippet unfortunately.

james-s-w-clark avatar Apr 08 '24 16:04 james-s-w-clark

@IdiosApps if the http connection is stale/dead and gets returned into the pool, you can get recurrent failures. check your ttls and if your read() gets errors: abort() the connection.

steveloughran avatar Apr 11 '24 19:04 steveloughran

not to say this is the root cause of the problem (there's probably a bug in the code since) but have you checked the throughput parameter in the client, it might be slowing things down by default

pilardi avatar Apr 11 '24 19:04 pilardi

our problems weren't with throughput manager, saw it for unknown host exceptions, e.g trying to use fips in a region without it

steveloughran avatar Apr 11 '24 19:04 steveloughran

@debora-ito we are still hitting this sometimes: if the number of retries * duration of attempt > timeout then a timeout exception is thrown without information.

I can see from the (deep) SDK stack traces and our attempts to debug things that the new execution pipeline is a fairly-sophisticated piece of wrapped and decorated functional invocations. Such a design is, to those of us who know their functional programming, something whose elegance is to be appreciated. However, it is not handling failure reporting very well. The exception raised in the last attempt to talk to a service needs to be saved somewhere so that when the timeout is reached and ApiCallTimeoutException raised, the underlying cause of the problem is actually communicated.

I have no immediate plans to get submit PRs to the SDK. I will note that the solution consistent with the current design would be to pass down and error reporter function to the executor

  1. add a function to RequestExecutionContext which is invoked whenever an exception is caught and retried.
  2. ApiCallAttemptTimeoutTrackingStage to provide an implementation of this which saves the reported exception
  3. ApiCallTimeoutTrackingStage.generateApiCallTimeoutException to use the inner cause when it generates the ApiCallTimeoutException .

steveloughran avatar May 31 '24 13:05 steveloughran

similar stack trace

Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@169ebccb[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@1646e74a[Wrapped task = software.amazon.awssdk.core.internal.http.timers.SyncTimeoutTask@12632a0f]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@164acb27[Shutting down, pool size = 2, active threads = 0, queued tasks = 2, completed tasks = 0] at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) at software.amazon.awssdk.core.internal.http.timers.TimerUtils.timeSyncTaskIfNeeded(TimerUtils.java:85) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:67) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42) at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78) at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39) at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56) at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50) at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37) at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26) at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182) at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74) at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45) at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53) at software.amazon.awssdk.services.s3.DefaultS3Client.headObject(DefaultS3Client.java:6315) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2935) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:431) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2923) at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2903) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:4037) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3965) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$26(S3AFileSystem.java:3942) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:445) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2748) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2767) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3940) at org.apache.hadoop.hive.ql.io.HdfsUtils.getFileId(HdfsUtils.java:69)

mukund-thakur avatar Jun 04 '24 17:06 mukund-thakur

ApiCallTimeoutException do not have a cause other than the configured timeout being hit, so no cause for this exception is expected. This seems like a reasonable feature request - it sounds useful for debugging purposes to know what exceptions are happening "under the hood".

I'm not sure the right place to surface those exceptions. Maybe as "suppressed exceptions" on the timeout exception? Or maybe we just expose them on the interceptors for capture?

Possible workarounds until we can figure out the right design for this:

  1. Underlying exceptions are logged under software.amazon.awssdk.request
  2. Instead of using API call timeout configuration with a high number of retries, you could rely on the default retry policy and allowing it to fail naturally after a fixed number of retries. This includes the underlying exceptions leading to the failure.
  3. The underlying exceptions are passed to the retry strategy. You could implement a retry condition that captures the exceptions:
RetryCondition loggingCondition = retryPolicyContext -> {
    retryPolicyContext.exception().printStackTrace();
    return false;
};

RetryPolicy loggingRetryPolicy =
    RetryPolicy.builder(RetryMode.STANDARD)
               .applyMutation(b -> b.retryCondition(OrRetryCondition.create(loggingCondition, b.retryCondition())))
               .build();

S3Client s3 = S3Client.builder()
                      .overrideConfiguration(c -> c.retryPolicy(loggingRetryPolicy))
                      .build();

millems avatar Jun 11 '24 22:06 millems

I think I'd like to disable all retries and wrap it in our own retry logic, but that's not viable as

  • s3 transfer manager doesn't let us wrap its invocations
  • we'd end up having to save the underlying cause as a thread-context variable, catch that later and when we get an ApiCallTimeoutException extract and raise it instead
  • The transfer manager does this across separate threads, so we don't gain much
  • so we'd have to completely replace it. and take on maintenance costs forever. This is something we've discussed in the past, it's just been hard to justify the effort.
  • and we also have to handle the S3 express create session logic which has its own timeouts and the like, which we haven't explored enough to see if it is resilient.

I really don't want to completely rip out a core part of the S3 portion of the SDK which aims to help developers. But if it is the only way to actually debug what is going wrong underneath, we are going to have to, aren't we?

It would be a lot better for all users if the raising of timeouts picked up whatever underlying cause was triggering retries.

steveloughran avatar Jun 12 '24 11:06 steveloughran