kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-17862: [buffer pool] corruption during buffer reuse from the pool

Open gongxuanzhang opened this issue 9 months ago • 13 comments

issue see https://issues.apache.org/jira/browse/KAFKA-17862

🔍 Problem Summary

When an expired batch is still part of an in-flight request, we prematurely release the ByteBuffer back to the BufferPool. This leads to two critical issues:

  1. Expiration does not prevent the in-flight request from being sent.
  2. The expired batch’s ByteBuffer is deallocate to the pool too early. It may be re-allocated for another producer batch while still being referenced by the in-flight request, potentially causing data corruption.

We can tolerate Issue 1, but Issue 2 is critical — we cannot allow it to happen.

Therefore, we remove the expiration handling of ProducerBatch before send, and instead defer the ByteBuffer deallocation to the response handling logic.

gongxuanzhang avatar Apr 16 '25 09:04 gongxuanzhang

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Apr 24 '25 03:04 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Apr 29 '25 03:04 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar May 15 '25 03:05 github-actions[bot]

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Aug 13 '25 03:08 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Aug 15 '25 03:08 github-actions[bot]

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Nov 13 '25 03:11 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Nov 15 '25 03:11 github-actions[bot]

Have you seen KAFKA-19012? I'm wondering if there's any relationship between it and KAFKA-17862 🤔

Yes, they encountered the same bug.

chia7712 avatar Dec 04 '25 06:12 chia7712

@junrao

https://github.com/apache/kafka/pull/19489/files#diff-51f6f7d29fde42e1e2ed2ade53b007201a34bcf9efd14f0e6e3fe8fe5114d892R59

Do you think this test is okay? I feel like I don’t have a great way to test it.

gongxuanzhang avatar Dec 08 '25 06:12 gongxuanzhang

There's a weird error that's failing both the JDK 17 and JDK 25 builds during the JUnit XML parsing step:

. . .
  Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.record.ControlRecordTypeTest.xml, module: clients, job: 25-noflaky-nonew
  Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.requests.TxnOffsetCommitRequestTest.xml, module: clients, job: 25-noflaky-nonew
  Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.record.LegacyRecordTest.xml, module: clients, job: 25-noflaky-nonew
  Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.security.oauthbearer.ClaimValidationUtilsTest.xml, module: clients, job: 25-noflaky-nonew
  Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.clients.consumer.internals.ApplicationEventHandlerTest.xml, module: clients, job: 25-noflaky-nonew
  Traceback (most recent call last):
    File "/home/runner/work/kafka/kafka/.github/scripts/junit.py", line 278, in <module>
      for suite in parse_report(workspace_path, report, fp):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/runner/work/kafka/kafka/.github/scripts/junit.py", line 149, in parse_report
      for (event, elem) in xml.etree.ElementTree.iterparse(fp, events=["start", "end"]):
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/xml/etree/ElementTree.py", line 1242, in iterator
      root = pullparser._close_and_return_root()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/xml/etree/ElementTree.py", line 1290, in _close_and_return_root
      root = self._parser.close()
             ^^^^^^^^^^^^^^^^^^^^
  xml.etree.ElementTree.ParseError: no element found: line 1, column 0

I'll look around and see if I can get more detail.

kirktrue avatar Dec 09 '25 00:12 kirktrue

@gongxuanzhang could you merge trunk to trigger the CI again?

chia7712 avatar Dec 09 '25 03:12 chia7712

https://github.com/apache/kafka/pull/19489/files#diff-51f6f7d29fde42e1e2ed2ade53b007201a34bcf9efd14f0e6e3fe8fe5114d892R59

Do you think this test is okay? I feel like I don’t have a great way to test it.

How about https://github.com/apache/kafka/pull/19489#discussion_r2593505677 ?

Also, could you address https://github.com/apache/kafka/pull/19489#discussion_r2593484432?

junrao avatar Dec 09 '25 05:12 junrao

@junrao
I’ve updated the code to trim invalid bytes and log their content. One question I have: if the trimmed byte array is very large, should we consider limiting the length of the logged data? Would it make sense to introduce a configuration option for this to avoid excessively long logs

gongxuanzhang avatar Dec 10 '25 06:12 gongxuanzhang

Not finished yet, please do not review. Some tests were affected

gongxuanzhang avatar Dec 12 '25 10:12 gongxuanzhang

@gongxuanzhang—thank you very much for finding a precision fix for the buffer corruption issue!

In terms of the Kafka 4.2.0 release cycle, the code freeze deadline has passed and we're focusing on blockers. KAFKA-17862 is marked as a blocker for 4.2.0, so let @junrao, @chia7712, or me know if you need some help resolving the issue.

Thanks for the work so far 🙏🙏🙏🙏

kirktrue avatar Dec 16 '25 17:12 kirktrue

@junrao @chia7712 This PR affects two existing tests, testDropCommitOnBatchExpiry and testDuplicateSequenceAfterProducerReset in TransactionManagerTest. Both of them rely on in-flight batches, and I’m not sure how to construct an expired batch under the new behavior. Could you help advise how these tests should be adapted?

gongxuanzhang avatar Dec 18 '25 02:12 gongxuanzhang

This PR affects two existing tests, testDropCommitOnBatchExpiry and testDuplicateSequenceAfterProducerReset in TransactionManagerTest. Both of them rely on in-flight batches, and I’m not sure how to construct an expired batch under the new behavior. Could you help advise how these tests should be adapted?

@artemlivshits @jolshan : Any recommendations on fixing the tests in TransactionManagerTest due to the changes in this PR?

junrao avatar Dec 18 '25 23:12 junrao

@gongxuanzhang testDropCommitOnBatchExpiry should rely on the NetworkClient's timeout handler rather than the Sender. Therefore, we should call time.sleep(10000) after sender.runOnce() to ensure the NetworkClient expires the request. Note that the request is created inside sender.runOnce(), so its creation time is distinct from the batch's creation time.

For example:

        ProducerTestUtils.runUntil(sender, () -> {
            time.sleep(10000); // ensure the MockClient can see the "updated" now
            return responseFuture.isDone();
        });

chia7712 avatar Dec 19 '25 09:12 chia7712

@chia7712 @junrao @kirktrue Thanks a lot for your review. The PR has now passed CI. PTAL

gongxuanzhang avatar Dec 19 '25 14:12 gongxuanzhang