aws-sdk-java-v2
aws-sdk-java-v2 copied to clipboard
Add an option to disable retries on UnknownHostException
Describe the bug
See #4738
The S3A connector doesn't want to retry on unknown host exception on the following basis: the host is unknown. While from the AWS perspective this may seem a transient problem, it's generally isn't.
- third-party stores with custom endpoints usually hit this when the endpoint what is URL is misconfigured. This is the configuration problem which will not be recovered from.
- region-based derivation of endpoints is brittle to the region being misconfigured. Again, retries achieve nothing here.
- Attempting to use an s3express bucket with a region which doesn't support it will trigger retries of the crate session API. And this will not resolved until AWS rolls out of that storage class that region.
Expected Behavior
attempts to connect to an endpoint which trigger UnknownHostException to fail fast and let our application logic decide what to do (for us: fail)
Current Behavior
- Api Call attempt is retried
- If too many retries are attempted for the api call timeout: failure with loss of inner cause
- S3AFS currently treats ApiCallTimeoutException as recoverable; we may need to revisit this.
Reproduction Steps
See HADOOP-19000 for this surfacing connecting to S3Express buckets.
Possible Solution
client builder to allow us to disable treating unknownHostException as something SDK should retry on.
Additional Information/Context
No response
AWS Java SDK version used
2.21.33
JDK version used
openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode
Operating System and version
macos 13.4.1
stack trace of where the exception is being raised
java.net.UnknownHostException: software.amazon.awssdk.core.exception.SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.: stevel2--usw2-az1--x-s3.s3express-usw2-az1.us-east-2.amazonaws.com
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at org.apache.hadoop.fs.s3a.impl.ErrorTranslation.wrapWithInnerIOE(ErrorTranslation.java:132)
at org.apache.hadoop.fs.s3a.impl.ErrorTranslation.maybeExtractIOException(ErrorTranslation.java:105)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:199)
at org.apache.hadoop.fs.s3a.Invoker.onceInTheFuture(Invoker.java:190)
at org.apache.hadoop.fs.s3a.Listing$ObjectListingIterator.next(Listing.java:652)
at org.apache.hadoop.fs.s3a.Listing$FileStatusListingIterator.requestNextBatch(Listing.java:431)
at org.apache.hadoop.fs.s3a.Listing$FileStatusListingIterator.<init>(Listing.java:373)
at org.apache.hadoop.fs.s3a.Listing.createFileStatusListingIterator(Listing.java:144)
at org.apache.hadoop.fs.s3a.Listing.getFileStatusesAssumingNonEmptyDir(Listing.java:265)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:3639)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$23(S3AFileSystem.java:3616)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$24(S3AFileSystem.java:3615)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2707)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2726)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:3614)
at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:276)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:456)
at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:242)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:301)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:285)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:121)
at org.apache.hadoop.fs.shell.Command.run(Command.java:192)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:327)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:390)
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
at software.amazon.awssdk.awscore.interceptor.HelpfulUnknownHostExceptionInterceptor.modifyException(HelpfulUnknownHostExceptionInterceptor.java:59)
at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.modifyException(ExecutionInterceptorChain.java:202)
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.runModifyException(ExceptionReportingUtils.java:54)
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.ExceptionReportingUtils.reportFailureToInterceptors(ExceptionReportingUtils.java:38)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:39)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
at software.amazon.awssdk.services.s3.DefaultS3Client.createSession(DefaultS3Client.java:1589)
at software.amazon.awssdk.services.s3.S3Client.createSession(S3Client.java:2505)
at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.getCredentials(S3ExpressIdentityCache.java:88)
at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.lambda$getCachedCredentials$0(S3ExpressIdentityCache.java:73)
at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.refreshResult(CachedS3ExpressCredentials.java:91)
at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.lambda$new$0(CachedS3ExpressCredentials.java:70)
at software.amazon.awssdk.utils.cache.CachedSupplier.lambda$jitteredPrefetchValueSupplier$8(CachedSupplier.java:300)
at software.amazon.awssdk.utils.cache.NonBlocking.fetch(NonBlocking.java:151)
at software.amazon.awssdk.utils.cache.CachedSupplier.refreshCache(CachedSupplier.java:208)
at software.amazon.awssdk.utils.cache.CachedSupplier.get(CachedSupplier.java:135)
at software.amazon.awssdk.services.s3.internal.s3express.CachedS3ExpressCredentials.get(CachedS3ExpressCredentials.java:85)
at software.amazon.awssdk.services.s3.internal.s3express.S3ExpressIdentityCache.get(S3ExpressIdentityCache.java:61)
at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.lambda$resolveIdentity$0(DefaultS3ExpressIdentityProvider.java:56)
at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:684)
at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:662)
at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2168)
at software.amazon.awssdk.services.s3.internal.s3express.DefaultS3ExpressIdentityProvider.resolveIdentity(DefaultS3ExpressIdentityProvider.java:49)
at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.trySelectAuthScheme(S3AuthSchemeInterceptor.java:142)
at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.selectAuthScheme(S3AuthSchemeInterceptor.java:81)
at software.amazon.awssdk.services.s3.auth.scheme.internal.S3AuthSchemeInterceptor.beforeExecution(S3AuthSchemeInterceptor.java:61)
at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.lambda$beforeExecution$1(ExecutionInterceptorChain.java:62)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at software.amazon.awssdk.core.interceptor.ExecutionInterceptorChain.beforeExecution(ExecutionInterceptorChain.java:62)
at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.runInitialInterceptors(AwsExecutionContextBuilder.java:239)
at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:130)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.invokeInterceptorsAndCreateExecutionContext(AwsSyncClientHandler.java:67)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:76)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
at software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:7323)
at software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$listObjectsV2$63(DelegatingS3Client.java:5856)
at software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:73)
at software.amazon.awssdk.services.s3.DelegatingS3Client.listObjectsV2(DelegatingS3Client.java:5856)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$13(S3AFileSystem.java:2963)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:431)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:2954)
at org.apache.hadoop.fs.s3a.S3AFileSystem$ListingOperationCallbacksImpl.lambda$listObjectsAsync$0(S3AFileSystem.java:2573)
at org.apache.hadoop.fs.s3a.impl.CallableSupplier.get(CallableSupplier.java:88)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Suppressed: java.lang.RuntimeException: Task failed.
at software.amazon.awssdk.utils.CompletableFutureUtils.joinLikeSync(CompletableFutureUtils.java:254)
at software.amazon.awssdk.auth.signer.AwsSignerExecutionAttribute.awsCredentialsReadMapping(AwsSignerExecutionAttribute.java:201)
at software.amazon.awssdk.core.interceptor.ExecutionAttribute$DerivationValueStorage.get(ExecutionAttribute.java:260)
at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getAttribute(ExecutionAttributes.java:53)
at software.amazon.awssdk.core.interceptor.ExecutionAttributes.getOptionalAttribute(ExecutionAttributes.java:68)
at software.amazon.awssdk.awscore.internal.AwsExecutionContextBuilder.invokeInterceptorsAndCreateExecutionContext(AwsExecutionContextBuilder.java:144)
... 22 more
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: stevel2--usw2-az1--x-s3.s3express-usw2-az1.us-east-2.amazonaws.com
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
... 56 more
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: stevel2--usw2-az1--x-s3.s3express-usw2-az1.us-east-2.amazonaws.com: nodename nor servname provided, or not known
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: stevel2--usw2-az1--x-s3.s3express-usw2-az1.us-east-2.amazonaws.com
Caused by: java.net.UnknownHostException: stevel2--usw2-az1--x-s3.s3express-usw2-az1.us-east-2.amazonaws.com
at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionManagerFactory$DelegatingHttpClientConnectionManager.connect(ClientConnectionManagerFactory.java:86)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at software.amazon.awssdk.thirdparty.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72)
at software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:254)
at software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:104)
at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:231)
at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:228)
at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:99)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:79)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:57)
at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
... 68 more
@steveloughran
Possible Solution client builder to allow us to disable treating unknownHostException as something SDK should retry on.
We do support custom RetryConditions in the ClientOverrideConfiguration retryPolicy, would this be sufficient to get you unblocked? You can use the SDK default retry condition as a base and remove the exceptions you don't want to retry.
I think we are going to have to explore this more. I've noticed during debugging work that the sdk has a policy "retry on all IOEs", which is why this comes in.
I'm probably going to see about turning off all retries in the AWS SDK and do it above where we have more control about policy, logging, etc. This is straightforward for direct s3a to s3 calls, but not for the indirect ones (s3express CreateSession and transfer manager.). might be time to do the oft-contemplated replacement of the transfer manager with something internal.