aws-sdk-java icon indicating copy to clipboard operation
aws-sdk-java copied to clipboard

Call to AWSApplicationAutoScalingClient.putScalingPolicy hangs indefinitely

Open lordpengwin opened this issue 1 year ago • 1 comments

Upcoming End-of-Support

  • [X] I acknowledge the upcoming end-of-support for AWS SDK for Java v1 was announced, and migration to AWS SDK for Java v2 is recommended.

Describe the bug

When deploying a service to ECS using the Java SDK I've seen many instances where a call to AWSApplicationAutoScalingClient.putScalingPolicy() never returns. This happens about 10% of the time that I use this call. I've even set withSdkRequestTimeout() on the PutScalingPolicyRequest and it still hangs. Note: The policy does get applied to the the service but the call never returns.

Is this a known problem? Is there a way that I can debug or work around it.

Expected Behavior

The SDK call should return or timeout.

Current Behavior

Hangs forever

Reproduction Steps

This is my code: autoScalingClient.putScalingPolicy(new PutScalingPolicyRequest() .withResourceId(resourceID) .withServiceNamespace(ServiceNamespace.Ecs) .withPolicyName(String.format(APPLICATION_SCALING_POLICY, deployedServiceName)) .withScalableDimension(ScalableDimension.EcsServiceDesiredCount) .withPolicyType(PolicyType.TargetTrackingScaling) .withTargetTrackingScalingPolicyConfiguration(new TargetTrackingScalingPolicyConfiguration() .withPredefinedMetricSpecification(new PredefinedMetricSpecification().withPredefinedMetricType(autoScaleConfig.getScaleUpMetric()).withResourceLabel(loadBalancerArn.substring(loadBalancerArn.indexOf("app/")) + "/" + targetGroupARN.substring(targetGroupARN.indexOf("targetgroup/")))) .withTargetValue(autoScaleConfig.getScaleUpThreshold()) .withScaleOutCooldown(autoScaleConfig.getScaleUpCooldown()) .withScaleInCooldown(autoScaleConfig.getScaleDownCooldown()) ) .withSdkRequestTimeout(30000) );

Possible Solution

No response

Additional Information/Context

I've also seen this when doing the same call against SageMaker.

AWS Java SDK version used

1.12.435

JDK version used

17.0.6

Operating System and version

container ubi9-minimal:latest

lordpengwin avatar Aug 28 '24 15:08 lordpengwin

It's unusual for the SDK client to hang indefinitely, I expect the request to timeout at some point. Since you see it across different clients I wonder if the issue is related to ECS.

Have you tried to reproduce in a different environment outside a container? Are you setting any custom ClientConfiguration when creating the autoScalingClient? Can you generate the verbose wirelogs? Instructions here. Make sure to redact any sensitive information like access keys.

debora-ito avatar Aug 28 '24 22:08 debora-ito

I've seen this with both ECS and SageMaker, though in either case I'm making the call to an Auto Scaling Group. I believe that I've seen this both from the container and from an Amazon Linux development machine. I'm not setting a custom ClientConfiguration on the autoScalingClient. I've also had this happen in multiple AWS accounts. I will try to do some experiments today to see if I can recreate the problem consistently, it has happened randomly in the past. If I can, I will try to enable the wire logs as described above. I will also try to get a Java thread dump.

lordpengwin avatar Aug 29 '24 11:08 lordpengwin

So I might have been wrong here. I managed to get my application to hang again and it does not appear to be stuck where I thought it was. It appears that it is simply not exiting. A thread dump shows this running:

"s3-transfer-manager-worker-1" #40 prio=5 os_prio=0 cpu=135446.59ms elapsed=8370.90s allocated=4078M defined_classes=95 tid=0x00007f327202d0d0 nid=0x70 waiting on condition [0x00007f323a4fe000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park([email protected]/Native Method) - parking to wait for <0x0000000715b2df10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:341) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:506) at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3463) at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3434) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1623) at java.util.concurrent.LinkedBlockingQueue.take([email protected]/LinkedBlockingQueue.java:435) at java.util.concurrent.ThreadPoolExecutor.getTask([email protected]/ThreadPoolExecutor.java:1062) at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1122) at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635) at java.lang.Thread.run([email protected]/Thread.java:833)

I suspect that the problem is that an S3 transfer manager is not being cleaned up correctly.

lordpengwin avatar Aug 29 '24 17:08 lordpengwin

This issue is now closed.

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Aug 29 '24 17:08 github-actions[bot]

I'm pretty sure that this was my problem. Thanks for the help

lordpengwin avatar Aug 29 '24 17:08 lordpengwin