amazon-kinesis-producer
amazon-kinesis-producer copied to clipboard
NullPointerException in getOldestRecordTimeInMillis()
Seeing an NPE in getOldestRecordTimeInMillis in certain circumstances. Interestingly enough, we didn't see it in load testing, just after we deployed to production, so it might be related to having lighter traffic?
Seems to only be happening in the callbacks - calls outside of the callback seem to be working fine.
getOldestRecordTimeInMillis is a no-arg method in the AWS code, so it's not a user error
Stack trace:
java.lang.NullPointerException: Cannot invoke "com.amazonaws.services.kinesis.producer.KinesisProducer$SettableFutureTracker.getTimestamp()" because the return value of "java.util.concurrent.PriorityBlockingQueue.peek()" is null at com.amazonaws.services.kinesis.producer.KinesisProducer.getOldestRecordTimeInMillis(KinesisProducer.java:678) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at [ELIDED_USER_CLASS_BASICALLY_COPIED_FROM_AWS_DOCS]$1.onSuccess(KinesisPublisher.java:335) ~[classes!/:?] at [ELIDED_USER_CLASS_BASICALLY_COPIED_FROM_AWS_DOCS]$1.onSuccess(KinesisPublisher.java:312) ~[classes!/:?] at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1138) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49) ~[guava-31.1-jre.jar!/:?] at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.onPutRecordResult(KinesisProducer.java:237) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.access$100(KinesisProducer.java:164) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler$1.run(KinesisProducer.java:171) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?]
Looks like a pretty bog-standard TOCTOU bug at this line:
https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer/src/main/java/com/amazonaws/services/kinesis/producer/KinesisProducer.java#L678
The tracker heap emptied after the check, and then peek was called on the queue.
Yeah, so the fix could be something like:
OLD, race-condition code:
@Override
public long getOldestRecordTimeInMillis() {
if (oldestFutureTrackerHeap.isEmpty()) {
return 0;
}
return Instant.now().toEpochMilli() - oldestFutureTrackerHeap.peek().getTimestamp().toEpochMilli();
}
Proper code:
@Override
public long getOldestRecordTimeInMillis() {
var oldestFuture = oldestFutureTrackerHeap.peek();
if (oldestFuture == null) {
return 0;
}
return Instant.now().toEpochMilli() - oldestFuture.getTimestamp().toEpochMilli();
}