AWSLogs#putLogEvents call never ends when the max batch size is exceeded
Describe the bug
We have several projects that depend on com.amazonaws:aws-java-sdk-logs and push our logs to CloudWatch. We've been using that in production for a while and everything worked fine so far.
Recently, some of our log entries became so big that some batches of log events processed with AWSLogs#putLogEvents started exceeding the maximum size allowed (1,048,576 bytes according to the Javadoc). Everytime such a batch was processed by one of our apps, the AWSLogs#putLogEvents call seemed to never end, leading to log entries accumulating in the app memory and eventually a crash of the application because the JVM heap got exhausted.
We can work around that issue by lowering the number of log events in a batch, effectively reducing the batch size. We can also limit the memory used by the in-memory log events queue, preventing a crash in case of accumulation. However, there's still a risk that the max batch size will be exceeded again in the future.
I tried to change a few AWSLogsClientBuilder configuration values (timeouts and friends) but nothing seemed to have any impact on the blocked call.
Expected Behavior
When the max batch size is exceeded, AWSLogs#putLogEvents should complete either with a status indicating that something got wrong or with an exception.
Current Behavior
AWSLogs#putLogEvents never completes.
Reproduction Steps
Call AWSLogs#putLogEvents with request that exceeds the maximum batch size.
Possible Solution
No response
Additional Information/Context
No response
AWS Java SDK version used
1.12.257 (production) and 1.12.262 (local reproducer)
JDK version used
openjdk version "11.0.15" 2022-04-19 LTS (local reproducer)
Operating System and version
Multiple operating systems
@gwenneg thank you for reaching out.
My suggestion in this case would be to tune the timeouts so the operation would fail and not hang for long. You said you tried to do that but it didn't work, I'd investigate why.
The SDK would not make input size validations if that's the ask.
Thank you for your answer @debora-ito.
My suggestion in this case would be to tune the timeouts so the operation would fail and not hang for long. You said you tried to do that but it didn't work, I'd investigate why.
The SDK would not make input size validations if that's the ask.
I tried all with*Timeout methods from com.amazonaws.ClientConfiguration but I'll double-check to make sure I didn't miss anything. However, even if that worked, it would still look like a workaround for the underlying issue to me: the server should reject invalid inputs and return the error to the HTTP client used by the SDK. The SDK should then forward that error somehow (with a status, an exception or something else) to our code. We need to differentiate invalid inputs from real timeouts.
the server should reject invalid inputs and return the error to the HTTP client used by the SDK. The SDK should then forward that error somehow (with a status, an exception or something else) to our code.
That's a valid point, but this would be a request for the Cloudwatch Log service team, I'll make sure to pass it to them.
Let us know of your findings around the timeouts.
After further investigation, it turns out that our code was both:
- using a
ScheduledExecutorServiceto push the log entries to CloudWatch and that executor service wasn't protected against exceptions (it stopped working as soon as one exception was thrown) - swallowing exceptions returned by CW, which made us unaware of issues that were actually there for a long time
A simple try/catch block was all we needed to fix the bug... 🤦
Sorry for the disturbance and for the delay before I could update this issue. Thanks for your help @debora-ito!
COMMENT VISIBILITY WARNING
Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.