librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

`rd_kafka_query_watermark_offsets` continues beyond timeout expiry

Open bjosv opened this issue 1 year ago • 1 comments

Description

When rd_kafka_query_watermark_offsets is called with a timeout below 1000ms it will most likely not timeout within that time.

rd_kafka_query_watermark_offsets sends one request to get the low watermark, and one request to get the high watermark. These requests are then put in a queue, which is checked every second to see if any request has timed out. If a response arrives before this check, indifferent to the request timeout, then rd_kafka_query_watermark_offsets returns the received offsets.

#4460 contained a fix for the timeout, but was replaced with #4225 which gives the inaccurate timeout handling.

Proposed correction

Use a timeout when calling rd_kafka_q_serve. This is how its solved in other APIs like rd_kafka_list_groups and rd_kafka_offsets_for_times, and also used in #4460.

How to reproduce

Lower the timeout used in the testcase to a value under 1s, like 100ms.

diff --git a/tests/0031-get_offsets.c b/tests/0031-get_offsets.c
index 569e377d..8c86f238 100644
--- a/tests/0031-get_offsets.c
+++ b/tests/0031-get_offsets.c
@@ -48,7 +48,7 @@ void test_query_watermark_offsets_timeout(void) {
         rd_kafka_t *rk;
         rd_kafka_conf_t *conf;
         const char *bootstraps;
-        const int timeout_ms = 1000;
+        const int timeout_ms = 100;
 
         TEST_SKIP_MOCK_CLUSTER();

Run testcase:

> ./interactive_broker_version.py 3.4.0
> TESTS=0031 make run_seq
...
[0031_get_offsets_mock       /  0.002s] Created    kafka instance 0031_get_offsets_mock#producer-4
[0031_get_offsets_mock       /  0.166s] TEST FAILURE
### Test "0031_get_offsets_mock (test_query_watermark_offsets_timeout:55)" failed at 0031-get_offsets.c:71:test_query_watermark_offsets_timeout() at Fri Oct 18 09:25:59 2024: ###
Test assertion failed: "err == RD_KAFKA_RESP_ERR__TIMED_OUT": Querying watermark offsets should fail with _TIMED_OUT when RTT > timeout, instead got NO_ERROR

which gives a NO_ERROR, i.e. no timeout occurred in rd_kafka_query_watermark_offsets. Note that the testcase took 0.166s. The issue has also been verified by "printf-debugging".

Optionally: update the mock to respond after 800ms, instead of (int)(timeout_ms * 1.2)), the call will still not timeout after 100ms.

[0031_get_offsets_mock       /  0.003s] Created    kafka instance 0031_get_offsets_mock#producer-4
[0031_get_offsets_mock       /  0.847s] TEST FAILURE
### Test "0031_get_offsets_mock (test_query_watermark_offsets_timeout:55)" failed at 0031-get_offsets.c:71:test_query_watermark_offsets_timeout() at Fri Oct 18 10:16:46 2024: ###
Test assertion failed: "err == RD_KAFKA_RESP_ERR__TIMED_OUT": Querying watermark offsets should fail with _TIMED_OUT when RTT > timeout, instead got NO_ERROR

Checklist

  • [x] librdkafka version (release number or git tag): v2.6.0 and master
  • [x] Apache Kafka version: 3.4.0
  • [x] librdkafka client configuration: as used in our tests.
  • [x] Operating system: any
  • [x] Provide logs (with debug=.. as necessary) from librdkafka
  • [x] Provide broker log excerpts
  • [ ] Critical issue

bjosv avatar Oct 18 '24 08:10 bjosv

Unfortunately, there are some wordings in the CLA for librdkafka that blocks me from contributing with a fix, otherwise that would have been simpler.

bjosv avatar Oct 18 '24 08:10 bjosv

Thanks @bjosv we need to revert this change

emasab avatar Nov 15 '24 17:11 emasab