flink-connector-kafka icon indicating copy to clipboard operation
flink-connector-kafka copied to clipboard

[FLINK-33201][Connectors/Kafka] Fix memory leak in CachingTopicSelector

Open qbx2 opened this issue 2 years ago • 4 comments

What is the purpose of the change

In the CachingTopicSelector, a memory leak may occur when the internal logic fails to check the cache size due to a race condition. (https://github.com/apache/flink-connector-kafka/blob/d89a082180232bb79e3c764228c4e7dbb9eb6b8b/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/sink/KafkaRecordSerializationSchemaBuilder.java#L287-L289) This PR fixes the memory leak by modifying the logic to be more resilient to failure.

Brief change log

Fix memory leak in CachingTopicSelector that can be triggered by race condition.

Verifying this change

caching-topic-selector-heap-dump-analysis

  • By analyzing a Java heap dump, I identified a memory leak in the CachingTopicSelector. As in the screenshot, cache has 47,769 elements. If the internal logic were functioning correctly, the number of elements should be less than or equal to CACHE_RESET_SIZE (which is 5).
  • Since writing unit tests for this type of bug is challenging, I instead applied a hotfix to our production workload. Before applying the hotfix, the memory leak is observed in the workload in 7 days. After applying the patch, the issue is no longer observed.

qbx2 avatar Oct 06 '23 02:10 qbx2

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

boring-cyborg[bot] avatar Oct 06 '23 02:10 boring-cyborg[bot]

@qbx2 Can you please rebase your PR?

MartijnVisser avatar Jan 18 '24 15:01 MartijnVisser

@MartijnVisser Sure, I just rebased it.

qbx2 avatar Jan 18 '24 15:01 qbx2

@MartijnVisser Could you please review?

qbx2 avatar Feb 13 '24 02:02 qbx2

Hey All- We're experiencing some memory issues and wondering if this could be the cause. Will this PR be reviewed soon? @MartijnVisser @tzulitai

cnmckee13 avatar Aug 14 '24 16:08 cnmckee13

@AHeise Can you take a look?

MartijnVisser avatar Aug 19 '24 07:08 MartijnVisser

Oh I see the tests can't be run because it's based on an old branch. I'm rebasing myself.

AHeise avatar Aug 22 '24 10:08 AHeise