aws-sdk-java icon indicating copy to clipboard operation
aws-sdk-java copied to clipboard

java.lang.ArithmeticException: / by zero in TokenBucket

Open mehakmeet opened this issue 3 years ago • 8 comments

Describe the bug

While reading from S3 spark executors are getting killed and whole stages are getting canceled.

Stacktrace snippet:

ava.lang.ArithmeticException: / by zero
at com.amazonaws.internal.TokenBucket.calculateTimeWindow(TokenBucket.java:321)
at com.amazonaws.internal.TokenBucket.updateClientSendingRate(TokenBucket.java:302)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1487)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1381)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
....

This was seen after upgrading from Aws java SDK bundle 1.11.901 to 1.12.132.

Expected Behavior

Read should succeed and no ArithemticException should be raised.

Current Behavior

Read fails causing the whole job to fail with an ArithmeticException in AWS SDK.

Reproduction Steps

Not reproducible, it's only being seen in one of our environments after upgrading.

Possible Solution

Any way to turn this throttling off as a workaround?

Additional Information/Context

No response

AWS Java SDK version used

Aws java SDK bundle 1.12.132

JDK version used

1.8.0_322

Operating System and version

x86_64 GNU/Linux

mehakmeet avatar Aug 23 '22 08:08 mehakmeet

CC: @ahmarsuhail

mehakmeet avatar Aug 23 '22 08:08 mehakmeet

note: the deployment seeing this is running in EC2, talking to a local s3 bucket. responses will be fast, if that is a factor

steveloughran avatar Aug 23 '22 09:08 steveloughran

also, I don't see anywhere in my local JDK where the string "/ by zero" is generated; in the JDK it's always "Division By Zero". Where i can find it is in Guava in com.google.common.math.IntMath#divide , which is used internally in some of the guava classes

steveloughran avatar Aug 23 '22 10:08 steveloughran

This is TokenBucket.calculateTimeWindow (link):

synchronized void calculateTimeWindow() {
        timeWindow = Math.pow((lastMaxRate * (1 - BETA)) / SCALE_CONSTANT, 1.0 / 3);
}

where SCALE_CONSTANT = 0.4;

Maybe the JDK is messing with the calculation?

I don't think there's a configuration to disable the token bucket calculation.

debora-ito avatar Aug 23 '22 18:08 debora-ito

this is weirder than i can imagine. if it was c/c++ i'd say "stack is toast"

  1. floating point division returns NaN, not an exception
  2. new ArithmeticException("/ by 0") only occurs in guava source, not JDK, which uses the string "Division by Zero"

steveloughran avatar Aug 24 '22 09:08 steveloughran

update / by 0 is the error message from the JVM; but there are no integer division calls in the bytecode of the class

steveloughran avatar Sep 01 '22 14:09 steveloughran

@mehakmeet Hi Mehakmeet , Could you please mention the Delta or difference in environment configs like Haddop/spark Jar version, Jdk versions of the environments where this issue is seen and where issue is not seen.

joviegas avatar Sep 02 '22 18:09 joviegas

@mehakmeet @steveloughran any new information?

debora-ito avatar Sep 20 '22 23:09 debora-ito

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

github-actions[bot] avatar Sep 26 '22 00:09 github-actions[bot]

our current suspicion is that some native code loaded earlier has done bad things, especially if on the LD_LIBRARY_PATH...there's a risk that something switched the FPU into blowing up on the / by 0.0 in your code. In which case, there are bigger problems than just this class, which is simply the first place where it is surfacing

steveloughran avatar Sep 26 '22 18:09 steveloughran

Our internal investigation has pointed to the cause to be not in the AWS SDK so closing this issue.

mehakmeet avatar Oct 17 '22 05:10 mehakmeet

COMMENT VISIBILITY WARNING

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Oct 17 '22 05:10 github-actions[bot]