KAFKA-15859: Make RemoteListOffsets call an async operation
This is the part-2 of the KIP-1075
To find the offset for a given timestamp, ListOffsets API is used by the client. When the topic is enabled with remote storage, then we have to fetch the remote indexes such as offset-index and time-index to serve the query. Also, the ListOffsets request can contain the query for multiple topics/partitions.
The time taken to read the indexes from remote storage is non-deterministic and the query is handled by the request-handler threads. If there are multiple LIST_OFFSETS queries and most of the request-handler threads are busy in reading the data from remote storage, then the other high-priority requests such as FETCH and PRODUCE might starve and be queued. This can lead to higher latency in producing/consuming messages.
In this patch, we have introduced a delayed operation for remote list-offsets call. If the timestamp need to be searched in the remote-storage, then the request-handler threads will pass-on the request to the remote-log-reader threads. And, the request gets handled in asynchronous fashion.
Covered the patch with unit and integration tests.
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@chia7712 @showuon @satishd
Call for review. PTAL.
Test failures are unrelated.
@kamalcph , I'd like to make sure this PR can be reviewed before KIP-1075 get approved. Is that right?
@kamalcph , I'd like to make sure this PR can be reviewed before KIP-1075 get approved. Is that right?
yes, this PR can be reviewed. There are no public API changes made in this PR. To define the timeout for delayed remote list offsets operation, reused the server request timeout since the tiered storage is not production ready. If it is not acceptable, then we may have to wait for the KIP-1075 approval.
Well, I'll review KIP-1075 first then.
@showuon @clolov
Is it possible to give this PR an early review to keep it in good shape? Thanks!
@kamalcph please fix conflicts, thanks :)
@showuon @chia7712 @satishd @clolov
The diff is ready for review. PTAL. Thanks!
@showuon Addressed your review comments. PTAL. Thanks!
Gentle ping for review.
@kamalcph Could you please fix the conflicts?
JDK11 test failures are unrelated to this PR, the tests were timed out.
Found 2 flaky test failures:
FLAKY ⚠️ MetricsDuringTopicCreationDeletionTest > "testMetricsDuringTopicCreateDelete(String).quorum=zk"
FLAKY ⚠️ OffsetsApiIntegrationTest > testAlterSinkConnectorOffsetsDifferentKafkaClusterTargeted()
Read env GRADLE_EXIT_CODE: 124
Read env THREAD_DUMP_URL: https://github.com/apache/kafka/actions/runs/10854248301/artifacts/1932678337
Gradle command timed out. These are partial results!
25602 tests cases run in 6h14m4s. 23379 PASSED ✅, 0 FAILED ❌, 2 FLAKY ⚠️ , 18 SKIPPED 🙈, and 0 errors.
Failing this step because the tests timed out. Thread dumps were taken and archived here: https://github.com/apache/kafka/actions/runs/10854248301/artifacts/1932678337
Error: Process completed with exit code 1.
@kamalcph thanks for checking the failed tests. They pass on my local. will merge this PR
./gradlew cleanTest :tools:test --tests MetadataQuorumCommandTest.testDescribeQuorumReplicationSuccessful :core:test --tests ListOffsetsIntegrationTest.testThreeNonCompressedRecordsInSeparateBatch --tests ConsumerBounceTest.testConsumptionWithBrokerFailures
@kamalcph any updates? or just trigger QA again?
I rebased the branch against trunk to retrigger the tests again.
I rebased the branch against trunk to retrigger the tests again.
Got it
Thank you all for the reviews!
@kamalcph, the test introduced in this PR is flaky on trunk https://ge.apache.org/scans/tests?search.names=CI%20workflow&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=America%2FNew_York&search.values=CI&tests.container=kafka.log.remote.RemoteLogOffsetReaderTest&tests.sortField=FLAKY
Can you take a look? I have filed KAFKA-17559 in the mean time. Thanks!
@mumrah Sorry for that flaky. I will take a look!
Opened #17214 to fix the flaky test. PTAL.
Thanks @junrao for the review comments! I will follow-up on them.