aws-sdk-java AWSLogs#putLogEvents call never ends when the max batch size is exceeded

Describe the bug

We have several projects that depend on com.amazonaws:aws-java-sdk-logs and push our logs to CloudWatch. We've been using that in production for a while and everything worked fine so far.

Recently, some of our log entries became so big that some batches of log events processed with AWSLogs#putLogEvents started exceeding the maximum size allowed (1,048,576 bytes according to the Javadoc). Everytime such a batch was processed by one of our apps, the AWSLogs#putLogEvents call seemed to never end, leading to log entries accumulating in the app memory and eventually a crash of the application because the JVM heap got exhausted.

We can work around that issue by lowering the number of log events in a batch, effectively reducing the batch size. We can also limit the memory used by the in-memory log events queue, preventing a crash in case of accumulation. However, there's still a risk that the max batch size will be exceeded again in the future.

I tried to change a few AWSLogsClientBuilder configuration values (timeouts and friends) but nothing seemed to have any impact on the blocked call.

Expected Behavior

When the max batch size is exceeded, AWSLogs#putLogEvents should complete either with a status indicating that something got wrong or with an exception.

Current Behavior

AWSLogs#putLogEvents never completes.

Reproduction Steps

Call AWSLogs#putLogEvents with request that exceeds the maximum batch size.

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

1.12.257 (production) and 1.12.262 (local reproducer)

JDK version used

openjdk version "11.0.15" 2022-04-19 LTS (local reproducer)

Operating System and version

Multiple operating systems

Jul 17 '22 15:07 gwenneg

@gwenneg thank you for reaching out.

My suggestion in this case would be to tune the timeouts so the operation would fail and not hang for long. You said you tried to do that but it didn't work, I'd investigate why.

The SDK would not make input size validations if that's the ask.

Jul 20 '22 17:07 debora-ito

Thank you for your answer @debora-ito.

My suggestion in this case would be to tune the timeouts so the operation would fail and not hang for long. You said you tried to do that but it didn't work, I'd investigate why.

The SDK would not make input size validations if that's the ask.

I tried all with*Timeout methods from com.amazonaws.ClientConfiguration but I'll double-check to make sure I didn't miss anything. However, even if that worked, it would still look like a workaround for the underlying issue to me: the server should reject invalid inputs and return the error to the HTTP client used by the SDK. The SDK should then forward that error somehow (with a status, an exception or something else) to our code. We need to differentiate invalid inputs from real timeouts.

Jul 20 '22 18:07 gwenneg

the server should reject invalid inputs and return the error to the HTTP client used by the SDK. The SDK should then forward that error somehow (with a status, an exception or something else) to our code.

That's a valid point, but this would be a request for the Cloudwatch Log service team, I'll make sure to pass it to them.

Let us know of your findings around the timeouts.

Aug 04 '22 21:08 debora-ito

After further investigation, it turns out that our code was both:

using a ScheduledExecutorService to push the log entries to CloudWatch and that executor service wasn't protected against exceptions (it stopped working as soon as one exception was thrown)
swallowing exceptions returned by CW, which made us unaware of issues that were actually there for a long time

A simple try/catch block was all we needed to fix the bug... 🤦

Sorry for the disturbance and for the delay before I could update this issue. Thanks for your help @debora-ito!

Sep 26 '22 08:09 gwenneg

COMMENT VISIBILITY WARNING

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sep 26 '22 08:09 github-actions[bot]