confluent-kafka-python
confluent-kafka-python copied to clipboard
Producer request timeout in container
Description
I used confluent-kafka-python to write a worker responsible for processing messages in a specific kafka topic. To implement EOS (Exactly Once Semantics), I enabled transactions and configured the producer and consumer settings accordingly. Since the tasks may involve a longer processing time, I set the producer's transaction.timeout.ms to 3 hours and also adjusted the broker's transaction.max.timeout.ms.
Currently, the worker runs fine on my local machine. However, when I run it in a local Docker container or in a production environment on Kubernetes, I often see the following messages:
%4|1738575666.358|REQTMOUT|rdkafka#producer-2| [thrd:TxnCoordinator]: TxnCoordinator/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575666.358|FAIL|rdkafka#producer-2| [thrd:TxnCoordinator]: TxnCoordinator: 10.11.123.134:9092: 1 request(s) timed out: disconnect (after 58468ms in state UP)
%4|1738575666.411|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575666.411|FAIL|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: 1 request(s) timed out: disconnect (after 26197ms in state UP)
%4|1738575667.420|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575667.420|FAIL|rdkafka#producer-2| [thrd:10.11.123.134:9092/bootstrap]: 10.11.123.134:9092/3: 1 request(s) timed out: disconnect (after 393ms in state UP, 1 identical error(s) suppressed)
%4|1738575667.924|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.132:9092/bootstrap]: 10.11.123.132:9092/1: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575667.924|FAIL|rdkafka#producer-2| [thrd:10.11.123.132:9092/bootstrap]: 10.11.123.132:9092/1: 1 request(s) timed out: disconnect (after 60053ms in state UP)
%4|1738575668.925|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575668.925|FAIL|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: 1 request(s) timed out: disconnect (after 486ms in state UP)
%4|1738575669.933|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%3|1738575669.934|FAIL|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: 1 request(s) timed out: disconnect (after 994ms in state UP, 1 identical error(s) suppressed)
%4|1738575670.942|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
%4|1738575671.952|REQTMOUT|rdkafka#producer-2| [thrd:10.11.123.133:9092/bootstrap]: 10.11.123.133:9092/2: Timed out 0 in-flight, 0 retry-queued, 1 out-queue, 0 partially-sent requests
It seems that the producer is encountering timeout when sending requests. Based on my online research, the requests are eventually sent successfully due to Kafka's retry mechanism. However, this issue keeps recurring, happening at least once every 15 minutes. When I restart the worker on my local machine, these issues disappear.
Another strange observation is that when I reduce the transaction.timeout.ms to a lower value (e.g., 30 minutes in my tests), these timeout messages no longer appear. However, as far as I know, this setting should not affect whether a request times out.
Has anyone else encountered a similar issue or knows the reason behind this behavior?
Checklist
Please provide the following information:
-
[x] confluent-kafka-python and librdkafka version (confluent_kafka.version('2.4.0') and confluent_kafka.libversion('2.4.0')):
-
[x] Apache Kafka broker version: 3.9.0
-
[ ] Client configuration: {...}
-
[x] Operating system: linux (k8s)、macOS (local machine)
-
[ ] Provide client logs (with 'debug': '..' as necessary)
-
[ ] Provide broker log excerpts
-
[ ] Critical issue
Apologies no one responded quickly to this issue. 3 hours is very long for transaction.timeout.ms, but your disconnect timeouts are well below that threshold you set in either situation. Did you by chance configure socket.timeout.ms or linger.ms separately which is causing issues before your transaction reaches any timeout? You may need to toggle socket.keepalive.enable with those long timeouts. In general it would be good to post your whole config if you are still experiencing issues.
Thank you for your reply. Based on your suggestion, I tried setting socket.keepalive.enable to true, but the issue still occurred. Afterwards, I tried setting socket.timeout.ms to 60000, which is the default value in librdkafka, and the problem no longer appeared. After further testing, I found that removing the socket.keepalive.enable = true setting didn’t seem to have any effect either. In the end, I only adjusted socket.timeout.ms to 60000.
What puzzles me is that I had not set socket.timeout.ms originally, and from my understanding, in that case the default value from librdkafka should have been applied. Since 60000 was already the default, this setting should not have fixed the problem. But in reality, the issue has not happened again.
In my worker, both a producer and a consumer are running. The following were the configurations I provided; the rest, as far as I understand, should be using librdkafka’s default values:
Producer:
{
"bootstrap.servers": "mykafka.cluster", // dummy cluster
"allow.auto.create.topics": false,
"transaction.id": "some_txn_id",
"enable.idempotence": true,
"acks": true,
"transaction.timeout.ms": 10800000
}
Consumer:
{
"bootstrap.servers": "mykafka.cluster", // dummy cluster
"group.id": "worker_group",
"auto_offset_reset": "earliest",
"partition.assignment.strategy": "cooperative-sticky",
"session.timeout.ms": 45000,
"allow.auto.create.topics": false,
"max.poll.interval.ms": 10800000,
"enable_auto_commit": false,
"isolation.level": "read_committed"
}