KAFKA-13559: Fix issue where responses intermittently takes 300+ ms to respond, even when the server is idle.
KAFKA-13559: Fix issue where responses intermittently takes 300+ ms to respond, even when the server is idle.
Processing request got delayed by 300 ms in the following condition:
- Client-Server communication uses SSL socket
- More than one requests are in the same network packet
This 300 ms delay occurs because the socket has no data but the buffer has data. And the sequence of events that leads to this situation is the following (high level):
Step 1 - Client sends more than one requests in the same network packet. Step 2 - Server processes the 1st request. While doing this, SslTransportLayer reads all of the bytes (containing multiple requests) from the socket and stores it in the buffer. Step 3 - Server sends the response for the 1st request. Step 4 - Server processes the 2nd request. This request is taken from the SslTransportLayer buffer, instead of the socket. Because of this, "select(timeout)" blocks for 300 ms. THIS IS WHERE THE DELAY IS.
From producer side, this happens when you produce continuous records in a tight loop and then suddenly stop for more than 300 ms.
To fix this, Selector set "madeReadProgressLastPoll" to "true" after unmuting the channel, if there's data in the buffer.
More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.
Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@badaiaqrandista Thanks for the update, looks good. But looks like there is a timing issue in the test since it has failed for the JDK8 PR build, can you take a look? Remember seeing it yesterday before the changes as well, so maybe a timing issue with the test itself.
@rajinisivaram With the help from @splett2 , the test is not failing anymore. Can you please have a look again when you're available? Thank you!!
Test failures not related.