trino
trino copied to clipboard
Add configurable retry policy for S3 client
Description
The stress testing and benchmarking of the S3 filesystem revealed errors in Hive connector as below:
Caused by: java.io.IOException: Failed to list location: s3://benchmark-sep-hive-us-east-2-tpcds-sf1000-01/sf1000/catalog_returns/cr_returned_date_sk=2451790
at io.trino.filesystem.s3.S3FileSystem.listFiles(S3FileSystem.java:195)
at io.trino.filesystem.manager.SwitchingFileSystem.listFiles(SwitchingFileSystem.java:110)
at io.trino.filesystem.tracing.TracingFileSystem.lambda$listFiles$4(TracingFileSystem.java:109)
at io.trino.filesystem.tracing.Tracing.withTracing(Tracing.java:47)
at io.trino.filesystem.tracing.TracingFileSystem.listFiles(TracingFileSystem.java:109)
at io.trino.filesystem.ForwardingTrinoFileSystem.listFiles(ForwardingTrinoFileSystem.java:89)
at io.trino.plugin.hive.fs.CachingDirectoryLister.listFilesRecursively(CachingDirectoryLister.java:96)
at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.createListingRemoteIterator(TransactionScopeCachingDirectoryLister.java:97)
at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.lambda$listInternal$0(TransactionScopeCachingDirectoryLister.java:78)
at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4955)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2328)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2187)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2081)
at com.google.common.cache.LocalCache.get(LocalCache.java:4036)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4950)
at io.trino.cache.EvictableCache.get(EvictableCache.java:112)
at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.listInternal(TransactionScopeCachingDirectoryLister.java:78)
at io.trino.plugin.hive.fs.TransactionScopeCachingDirectoryLister.listFilesRecursively(TransactionScopeCachingDirectoryLister.java:70)
at io.trino.plugin.hive.fs.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:140)
... 11 more
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 3TWYHXY49Z761P7Y, Extended Request ID: nDKplvh5sDhgsEEJAVCPHmiDsF0vUlIQMKbvRfjs1sFK+4WGtsZYtDHn6ed1mCmHx/9VjgWKRoI=)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93)
at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
a
This change allows native filesytem S3 client to have a configurable retry mechanism since the default retry mechanism does not seem to be good enough. As per AWS team's recommendations, played around with the max error retry count, and bumping this up from the default of 3 helped with fixing the issue. AWS support also suggests having a retry mode to STANDARD
for some workloads. As per https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ the default "Equal Jitter" is the loser. So having this configurable might help for some workloads. So exposing this setting to be configurable as well
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. ( ) Release notes are required, with the following suggested text:
# Section
* Fix some things. ({issue}`issuenumber`)
cc @charlesjmorgan @findinpath