amazon-kinesis-producer icon indicating copy to clipboard operation
amazon-kinesis-producer copied to clipboard

NullPointerException in getOldestRecordTimeInMillis()

Open StuAtGit opened this issue 2 years ago • 1 comments

Seeing an NPE in getOldestRecordTimeInMillis in certain circumstances. Interestingly enough, we didn't see it in load testing, just after we deployed to production, so it might be related to having lighter traffic?

Seems to only be happening in the callbacks - calls outside of the callback seem to be working fine.

getOldestRecordTimeInMillis is a no-arg method in the AWS code, so it's not a user error

Stack trace:

java.lang.NullPointerException: Cannot invoke "com.amazonaws.services.kinesis.producer.KinesisProducer$SettableFutureTracker.getTimestamp()" because the return value of "java.util.concurrent.PriorityBlockingQueue.peek()" is null at com.amazonaws.services.kinesis.producer.KinesisProducer.getOldestRecordTimeInMillis(KinesisProducer.java:678) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at [ELIDED_USER_CLASS_BASICALLY_COPIED_FROM_AWS_DOCS]$1.onSuccess(KinesisPublisher.java:335) ~[classes!/:?] at [ELIDED_USER_CLASS_BASICALLY_COPIED_FROM_AWS_DOCS]$1.onSuccess(KinesisPublisher.java:312) ~[classes!/:?] at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1138) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:783) ~[guava-31.1-jre.jar!/:?] at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:49) ~[guava-31.1-jre.jar!/:?] at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.onPutRecordResult(KinesisProducer.java:237) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.access$100(KinesisProducer.java:164) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler$1.run(KinesisProducer.java:171) ~[amazon-kinesis-producer-0.14.13.jar!/:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?]

Looks like a pretty bog-standard TOCTOU bug at this line:

https://github.com/awslabs/amazon-kinesis-producer/blob/master/java/amazon-kinesis-producer/src/main/java/com/amazonaws/services/kinesis/producer/KinesisProducer.java#L678

The tracker heap emptied after the check, and then peek was called on the queue.

StuAtGit avatar Apr 10 '23 18:04 StuAtGit

Yeah, so the fix could be something like:

OLD, race-condition code:

    @Override
    public long getOldestRecordTimeInMillis() {
        if (oldestFutureTrackerHeap.isEmpty()) {
            return 0;
        }
        return Instant.now().toEpochMilli() - oldestFutureTrackerHeap.peek().getTimestamp().toEpochMilli();
    }

Proper code:

    @Override
   public long getOldestRecordTimeInMillis() {
       var oldestFuture = oldestFutureTrackerHeap.peek();
       if (oldestFuture == null) {
           return 0;
       }
       return Instant.now().toEpochMilli() - oldestFuture.getTimestamp().toEpochMilli();
   }

StuAtGit avatar Apr 10 '23 18:04 StuAtGit