KAFKA-17862: [buffer pool] corruption during buffer reuse from the pool
issue see https://issues.apache.org/jira/browse/KAFKA-17862
🔍 Problem Summary
When an expired batch is still part of an in-flight request, we prematurely release the ByteBuffer back to the BufferPool. This leads to two critical issues:
- Expiration does not prevent the in-flight request from being sent.
- The expired batch’s ByteBuffer is deallocate to the pool too early. It may be re-allocated for another producer batch while still being referenced by the in-flight request, potentially causing data corruption.
We can tolerate Issue 1, but Issue 2 is critical — we cannot allow it to happen.
Therefore, we remove the expiration handling of ProducerBatch before send, and instead defer the ByteBuffer deallocation to the response handling logic.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
Have you seen KAFKA-19012? I'm wondering if there's any relationship between it and KAFKA-17862 🤔
Yes, they encountered the same bug.
@junrao
https://github.com/apache/kafka/pull/19489/files#diff-51f6f7d29fde42e1e2ed2ade53b007201a34bcf9efd14f0e6e3fe8fe5114d892R59
Do you think this test is okay? I feel like I don’t have a great way to test it.
There's a weird error that's failing both the JDK 17 and JDK 25 builds during the JUnit XML parsing step:
. . .
Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.record.ControlRecordTypeTest.xml, module: clients, job: 25-noflaky-nonew
Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.requests.TxnOffsetCommitRequestTest.xml, module: clients, job: 25-noflaky-nonew
Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.record.LegacyRecordTest.xml, module: clients, job: 25-noflaky-nonew
Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.common.security.oauthbearer.ClaimValidationUtilsTest.xml, module: clients, job: 25-noflaky-nonew
Parsing file: build/junit-xml/clients/25-noflaky-nonew/TEST-org.apache.kafka.clients.consumer.internals.ApplicationEventHandlerTest.xml, module: clients, job: 25-noflaky-nonew
Traceback (most recent call last):
File "/home/runner/work/kafka/kafka/.github/scripts/junit.py", line 278, in <module>
for suite in parse_report(workspace_path, report, fp):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/kafka/kafka/.github/scripts/junit.py", line 149, in parse_report
for (event, elem) in xml.etree.ElementTree.iterparse(fp, events=["start", "end"]):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/xml/etree/ElementTree.py", line 1242, in iterator
root = pullparser._close_and_return_root()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/xml/etree/ElementTree.py", line 1290, in _close_and_return_root
root = self._parser.close()
^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
I'll look around and see if I can get more detail.
@gongxuanzhang could you merge trunk to trigger the CI again?
https://github.com/apache/kafka/pull/19489/files#diff-51f6f7d29fde42e1e2ed2ade53b007201a34bcf9efd14f0e6e3fe8fe5114d892R59
Do you think this test is okay? I feel like I don’t have a great way to test it.
How about https://github.com/apache/kafka/pull/19489#discussion_r2593505677 ?
Also, could you address https://github.com/apache/kafka/pull/19489#discussion_r2593484432?
@junrao
I’ve updated the code to trim invalid bytes and log their content. One question I have: if the trimmed byte array is very large, should we consider limiting the length of the logged data? Would it make sense to introduce a configuration option for this to avoid excessively long logs
Not finished yet, please do not review. Some tests were affected
@gongxuanzhang—thank you very much for finding a precision fix for the buffer corruption issue!
In terms of the Kafka 4.2.0 release cycle, the code freeze deadline has passed and we're focusing on blockers. KAFKA-17862 is marked as a blocker for 4.2.0, so let @junrao, @chia7712, or me know if you need some help resolving the issue.
Thanks for the work so far 🙏🙏🙏🙏
@junrao @chia7712 This PR affects two existing tests, testDropCommitOnBatchExpiry and testDuplicateSequenceAfterProducerReset in TransactionManagerTest. Both of them rely on in-flight batches, and I’m not sure how to construct an expired batch under the new behavior. Could you help advise how these tests should be adapted?
This PR affects two existing tests, testDropCommitOnBatchExpiry and testDuplicateSequenceAfterProducerReset in TransactionManagerTest. Both of them rely on in-flight batches, and I’m not sure how to construct an expired batch under the new behavior. Could you help advise how these tests should be adapted?
@artemlivshits @jolshan : Any recommendations on fixing the tests in TransactionManagerTest due to the changes in this PR?
@gongxuanzhang testDropCommitOnBatchExpiry should rely on the NetworkClient's timeout handler rather than the Sender. Therefore, we should call time.sleep(10000) after sender.runOnce() to ensure the NetworkClient expires the request. Note that the request is created inside sender.runOnce(), so its creation time is distinct from the batch's creation time.
For example:
ProducerTestUtils.runUntil(sender, () -> {
time.sleep(10000); // ensure the MockClient can see the "updated" now
return responseFuture.isDone();
});
@chia7712 @junrao @kirktrue Thanks a lot for your review. The PR has now passed CI. PTAL