aws-sdk-java Call to AWSApplicationAutoScalingClient.putScalingPolicy hangs indefinitely

Upcoming End-of-Support

[X] I acknowledge the upcoming end-of-support for AWS SDK for Java v1 was announced, and migration to AWS SDK for Java v2 is recommended.

Describe the bug

When deploying a service to ECS using the Java SDK I've seen many instances where a call to AWSApplicationAutoScalingClient.putScalingPolicy() never returns. This happens about 10% of the time that I use this call. I've even set withSdkRequestTimeout() on the PutScalingPolicyRequest and it still hangs. Note: The policy does get applied to the the service but the call never returns.

Is this a known problem? Is there a way that I can debug or work around it.

Expected Behavior

The SDK call should return or timeout.

Current Behavior

Hangs forever

Reproduction Steps

This is my code: autoScalingClient.putScalingPolicy(new PutScalingPolicyRequest() .withResourceId(resourceID) .withServiceNamespace(ServiceNamespace.Ecs) .withPolicyName(String.format(APPLICATION_SCALING_POLICY, deployedServiceName)) .withScalableDimension(ScalableDimension.EcsServiceDesiredCount) .withPolicyType(PolicyType.TargetTrackingScaling) .withTargetTrackingScalingPolicyConfiguration(new TargetTrackingScalingPolicyConfiguration() .withPredefinedMetricSpecification(new PredefinedMetricSpecification().withPredefinedMetricType(autoScaleConfig.getScaleUpMetric()).withResourceLabel(loadBalancerArn.substring(loadBalancerArn.indexOf("app/")) + "/" + targetGroupARN.substring(targetGroupARN.indexOf("targetgroup/")))) .withTargetValue(autoScaleConfig.getScaleUpThreshold()) .withScaleOutCooldown(autoScaleConfig.getScaleUpCooldown()) .withScaleInCooldown(autoScaleConfig.getScaleDownCooldown()) ) .withSdkRequestTimeout(30000) );

Possible Solution

No response

Additional Information/Context

I've also seen this when doing the same call against SageMaker.

AWS Java SDK version used

1.12.435

JDK version used

17.0.6

Operating System and version

container ubi9-minimal:latest

Aug 28 '24 15:08 lordpengwin

It's unusual for the SDK client to hang indefinitely, I expect the request to timeout at some point. Since you see it across different clients I wonder if the issue is related to ECS.

Have you tried to reproduce in a different environment outside a container? Are you setting any custom ClientConfiguration when creating the autoScalingClient? Can you generate the verbose wirelogs? Instructions here. Make sure to redact any sensitive information like access keys.

Aug 28 '24 22:08 debora-ito

I've seen this with both ECS and SageMaker, though in either case I'm making the call to an Auto Scaling Group. I believe that I've seen this both from the container and from an Amazon Linux development machine. I'm not setting a custom ClientConfiguration on the autoScalingClient. I've also had this happen in multiple AWS accounts. I will try to do some experiments today to see if I can recreate the problem consistently, it has happened randomly in the past. If I can, I will try to enable the wire logs as described above. I will also try to get a Java thread dump.

Aug 29 '24 11:08 lordpengwin

So I might have been wrong here. I managed to get my application to hang again and it does not appear to be stuck where I thought it was. It appears that it is simply not exiting. A thread dump shows this running:

"s3-transfer-manager-worker-1" #40 prio=5 os_prio=0 cpu=135446.59ms elapsed=8370.90s allocated=4078M defined_classes=95 tid=0x00007f327202d0d0 nid=0x70 waiting on condition [0x00007f323a4fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park([email protected]/Native Method) - parking to wait for <0x0000000715b2df10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:506) at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3463) at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3434) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1623) at java.util.concurrent.LinkedBlockingQueue.take([email protected]/LinkedBlockingQueue.java:435) at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1062) at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1122) at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635) at java.lang.Thread.run([email protected]/Thread.java:833)

I suspect that the problem is that an S3 transfer manager is not being cleaned up correctly.

Aug 29 '24 17:08 lordpengwin

This issue is now closed.

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

Aug 29 '24 17:08 github-actions[bot]

I'm pretty sure that this was my problem. Thanks for the help

Aug 29 '24 17:08 lordpengwin