spring-cloud-aws icon indicating copy to clipboard operation
spring-cloud-aws copied to clipboard

S3 timing out on CI since 3.0.3 (tests using LocalStack)

Open daniel-frak opened this issue 10 months ago • 4 comments

Type: Bug

Component: S3

Describe the bug Running any version of spring-cloud-aws-dependencies higher than 3.0.2 (tested with 3.0.3 - 3.1.1) causes our Gitlab CI runner (an AWS t3.xlarge instance) to time out on S3 operations:

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Read timed out
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1156)
	at software.amazon.awssdk.services.s3.S3Client.createBucket(S3Client.java:1708)
	at [...]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Read timed out
	Suppressed: java.lang.NullPointerException: Cannot invoke "java.util.Set.forEach(java.util.function.Consumer)" because the return value of "org.junit.jupiter.api.extension.ExtensionContext$Store.remove(Object, java.lang.Class)" is null
		at org.mockito.junit.jupiter.MockitoExtension.afterEach(MockitoExtension.java:190)

While the tests usually take 6-10 min, here they take up to 43 minutes before failing the build.

We are running the tests using Maven and creating a LocalStack instance using TestContainers (1.19.7). The relevant code is:

s3Client = S3Client.builder()
  .endpointOverride(LOCAL_STACK_CONTAINER.getEndpointOverride(LocalStackContainer.Service.S3))
  .credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(
    LOCAL_STACK_CONTAINER.getAccessKey(), LOCAL_STACK_CONTAINER.getSecretKey())))
  .region(Region.of(LOCAL_STACK_CONTAINER.getRegion()))
  .build();

// ...

// It times out here (@BeforeEach):
s3Client.createBucket(request -> request.bucket(bucket));

// ...And here (@AfterEach):
s3Client.deleteBucket(request -> request.bucket(bucket));

Importantly, everything works perfectly well on a local machine (the entire build takes about 3.5min). The issues only occur on CI. Furthermore, the issue does not occur on version 3.0.2. On 3.0.2, all S3 operations take milliseconds, as expected (even on CI).

I thought that maybe it's the fault of a new version of AWS SDK, so I tried downgrading AWS SDK to the version from 3.0.2:

<dependencyManagement>
	<dependencies>
		<dependency>
			<groupId>software.amazon.awssdk</groupId>
			<artifactId>bom</artifactId>
			<version>2.20.63</version>
			<type>pom</type>
			<scope>import</scope>
		</dependency>
		<dependency>
			<groupId>software.amazon.awssdk</groupId>
			<artifactId>s3-transfer-manager</artifactId>
			<version>2.20.63</version>
			<optional>true</optional>
		</dependency>
	</dependencies>
</dependencyManagement>

I found a semi-similar problem on LocalStack's Github, so I tried running LocalStack with the suggested environment variables:

DockerImageName imageName = DockerImageName.parse("localstack/localstack:latest")
        .asCompatibleSubstituteFor("localstack/localstack");
return (LocalStackContainerWrapper) new LocalStackContainer(imageName)
        // https://github.com/localstack/localstack/issues/10340#issuecomment-2008323237
        .withEnv("GATEWAY_SERVER", "twisted")
        // https://github.com/localstack/localstack/issues/10340#issuecomment-1979282202
        .withEnv("PROVIDER_OVERRIDE_S3", "legacy_v2");

~None of the above fixed the problem. I tried looking through changes between 3.0.2 and 3.0.3 but I'm unable to find anything that would stand out as a likely culprit.~ (EDIT: It turns out AWS SDK is to blame after all. See comment below)

The bucket the tests are struggling to create and then remove is called test-bucket, so I don't think any name-related problems are to blame.

At the same time, I really need to update this dependency (I need the FIFO queue name bugfix introduced in 3.1.1) and I'm at a loss what to do.

The runner instance shows only mild memory and CPU consumption, so it doesn't seem like lack of resources is the problem here. Below is a simplified version of our .gitlab-ci.yml:

variables:
  MAVEN_CLI_OPTS: >-
    -gs ./ci_settings.xml 
    -Dmaven.repo.local=${CI_PROJECT_DIR}/.m2/repository 
    -Djavax.net.ssl.trustStore=./cacerts
    --batch-mode
    --errors

image: [...]/maven:3.9.4-amazoncorretto-21

code_quality_analysis:
  before_script:
    - >
      export DOCKER_AUTH_CONFIG=[...]
script:
    - > # Run tests
      mvn clean verify $MAVEN_CLI_OPTS
      -T 2C
# ...

I don't think our ci_settings.xml proxies are to blame, but just in case here's a simplified version of those:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  
  <proxies>
    <proxy>
      <id>[...]-http</id>
      <active>true</active>
      <protocol>http</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
    <proxy>
      <id>[...]-https</id>
      <active>true</active>
      <protocol>https</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

In summary, ~some change~ (EDIT: the AWS SDK update) in 3.0.3 is causing our CI to time out during S3 operations ~and we can't figure out what it is~.

Sample Unfortunately, so far I was unable to reproduce this problem outside of Gitlab CI.

daniel-frak avatar Mar 26 '24 10:03 daniel-frak