aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

S3 timing out on CI (tests using LocalStack)

Open daniel-frak opened this issue 10 months ago • 9 comments

Describe the bug

AWS SDK 2.21.12+ causes our Gitlab CI runner (an AWS t3.xlarge instance) to time out on S3 operations. While the tests usually take 6-10 min, here they take up to 43 minutes before failing the build.

Importantly, everything works perfectly well on a local machine (the entire build takes about 3.5min). The issues only occur on CI. Furthermore, the issue does not occur on version 2.20.63. On 2.20.63, all S3 operations take milliseconds, as expected (even on CI).

Expected Behavior

S3 operations should not time out during CI builds.

Current Behavior

S3 operations time out during CI builds.

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Read timed out
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1156)
	at software.amazon.awssdk.services.s3.S3Client.createBucket(S3Client.java:1708)
	at [...]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Read timed out
	Suppressed: java.lang.NullPointerException: Cannot invoke "java.util.Set.forEach(java.util.function.Consumer)" because the return value of "org.junit.jupiter.api.extension.ExtensionContext$Store.remove(Object, java.lang.Class)" is null
		at org.mockito.junit.jupiter.MockitoExtension.afterEach(MockitoExtension.java:190)

Reproduction Steps

We are running the tests using Maven and creating a LocalStack instance using TestContainers (1.19.7). The relevant code is:

s3Client = S3Client.builder()
  .endpointOverride(LOCAL_STACK_CONTAINER.getEndpointOverride(LocalStackContainer.Service.S3))
  .credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(
    LOCAL_STACK_CONTAINER.getAccessKey(), LOCAL_STACK_CONTAINER.getSecretKey())))
  .region(Region.of(LOCAL_STACK_CONTAINER.getRegion()))
  .build();

// ...

// It times out here (@BeforeEach):
s3Client.createBucket(request -> request.bucket(bucket));

// ...And here (@AfterEach):
s3Client.deleteBucket(request -> request.bucket(bucket));

Below is a simplified version of our .gitlab-ci.yml:

variables:
  MAVEN_CLI_OPTS: >-
    -gs ./ci_settings.xml 
    -Dmaven.repo.local=${CI_PROJECT_DIR}/.m2/repository 
    -Djavax.net.ssl.trustStore=./cacerts
    --batch-mode
    --errors

image: [...]/maven:3.9.4-amazoncorretto-21

code_quality_analysis:
  before_script:
    - >
      export DOCKER_AUTH_CONFIG=[...]
script:
    - > # Run tests
      mvn clean verify $MAVEN_CLI_OPTS
      -T 2C
# ...

I don't think our ci_settings.xml proxies are to blame, but just in case here's a simplified version of those:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  
  <proxies>
    <proxy>
      <id>[...]-http</id>
      <active>true</active>
      <protocol>http</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
    <proxy>
      <id>[...]-https</id>
      <active>true</active>
      <protocol>https</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

Possible Solution

it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12. I would assume some code introduced between those version is the reason for this issue.

Additional Information/Context

I noticed the bug when upgrading spring-cloud-aws-dependencies from 3.0.1 to 3.0.3+. spring-cloud-aws-dependencies:3.0.1 depends on AWS SDK 2.20.63, while spring-cloud-aws-dependencies:3.0.3 depends on 2.21.12.

I noticed the bug appears on spring-cloud-aws-dependencies:3.0.3, but disappears when I manually downgrade AWS SDK to 2.21.12. I tried upgrading AWS SDK to 2.25.17, but the problem persists. Therefore, it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12.

The runner instance shows only mild memory and CPU consumption, so it doesn't seem like lack of resources is the problem here.

AWS Java SDK version used

2.21.12

JDK version used

maven:3.9.4-amazoncorretto-21 (docker image)

Operating System and version

Alpine Linux 3.16

daniel-frak avatar Mar 26 '24 13:03 daniel-frak