pulsar-client-python
pulsar-client-python copied to clipboard
Segmentation fault sometimes happens when running unit tests
See
- https://github.com/apache/pulsar-client-python/actions/runs/8242753702/job/22542297702?pr=202
- https://github.com/apache/pulsar-client-python/actions/runs/8242753702/job/22542206316?pr=202
...
2024-03-12T02:35:31.1396530Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ClientConnection:886 | [[::1]:52958 -> [::1]:6650] Handling incoming command: SEND_RECEIPT
2024-03-12T02:35:31.1398821Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ClientConnection:1498 | [[::1]:52958 -> [::1]:6650] Got receipt for producer: 0 -- msg: 3-- message id: (81,3,-1,-1)
2024-03-12T02:35:31.1401720Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ProducerImpl:940 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, standalone-0-57] Received ack for msg 3
2024-03-12T02:35:31.1404238Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ClientConnection:861 | [[::1]:52958 -> [::1]:6650] Received a message from the server for consumer: 1
2024-03-12T02:35:31.1405802Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ConsumerImpl:516 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, my-sub, 1] Received Message -- Size: 7
2024-03-12T02:35:31.1407524Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ConsumerImpl:564 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, my-sub, 1] metadata.num_messages_in_batch() = 1
2024-03-12T02:35:31.1410975Z 2024-03-12 02:35:31.139 DEBUG [140455429797440] ConsumerImpl:566 | [persistent://public/default/my-python-topic-keyshare-sticky-1710210931.0501215, my-sub, 1] metadata.has_num_messages_in_batch() = 0
2024-03-12T02:35:31.1413569Z 2024-03-12 02:35:31.140 DEBUG [140455413016128] RetryableOperation:129 | Run operation get-partition-metadata-persistent://public/default/partitioned_topic_name_test, remaining time: 29900 ms
2024-03-12T02:35:35.1471950Z ....................................../tests/run-unit-tests.sh: line 28: 3956 Segmentation fault (core dumped) python3 pulsar_test.py
2024-03-12 02:35:31.040 INFO [140455429797440] RetryableOperation:114 | Reschedule get-partition-metadata-persistent://public/default/partitioned_topic_name_test for 100 ms, remaining time: 29900 ms
2024-03-12 02:35:31.040 DEBUG [140455429797440] ExecutorService:44 | Event loop of ExecutorService exits successfully
2024-03-12 02:35:31.040 DEBUG [140457878760320] ClientImpl:775 | listenerExecutorProvider_ is closed
2024-03-12 02:35:31.040 DEBUG [140457878760320] ClientImpl:780 | partitionListenerExecutorProvider_ is closed
2024-03-12 02:35:31.041 DEBUG [140457878760320] ProducerImpl:117 | [persistent://public/default/partitioned_topic_name_test-partition-0, standalone-0-53] ~ProducerImpl
2024-03-12 02:35:31.041 INFO [140457878760320] ProducerImpl:757 | Producer - [persistent://public/default/partitioned_topic_name_test-partition-0, standalone-0-53] , [batching = off]
2024-03-12 02:35:31.041 DEBUG [140457878760320] ProducerImpl:117 | [persistent://public/default/partitioned_topic_name_test-partition-1, standalone-0-54] ~ProducerImpl
2024-03-12 02:35:31.041 INFO [140457878760320] ProducerImpl:757 | Producer - [persistent://public/default/partitioned_topic_name_test-partition-1, standalone-0-54] , [batching = off]
2024-03-12 02:35:31.041 DEBUG [140457878760320] ProducerImpl:117 | [persistent://public/default/partitioned_topic_name_test-partition-2, standalone-0-55] ~ProducerImpl
2024-03-12 02:35:31.041 INFO [140457878760320] ProducerImpl:757 | Producer - [persistent://public/default/partitioned_topic_name_test-partition-2, standalone-0-55] , [batching = off]
# another test...
2024-03-12 02:35:31.140 DEBUG [140455413016128] RetryableOperation:129 | Run operation get-partition-metadata-persistent://public/default/partitioned_topic_name_test, remaining time: 29900 ms
It seems that the RetryableOperation is still scheduled after client.close() is called.
The root cause might be an incorrect implementation of the shutdown of a partitioned producer.
Here are the output when I ran PulsarTest.test_shutdown_client with debug level logs:
2024-03-12 15:09:35.278 DEBUG [0x1e4d42100] ClientImpl:699 | 1 producers and 0 consumers have been shutdown.
2024-03-12 15:09:35.278 DEBUG [0x16df5f000] ConnectionPool:86 | Got connection from pool for pulsar://localhost:6650-0 use_count: 2 @ 0x15ca15e00
2024-03-12 15:09:35.278 INFO [0x1e4d42100] ClientConnection:1325 | [127.0.0.1:63468 -> 127.0.0.1:6650] Connection disconnected (refCnt: 2)
2024-03-12 15:09:35.278 DEBUG [0x16de47000] ClientConnection:1262 | [127.0.0.1:63468 -> 127.0.0.1:6650] Ignoring timer cancelled event, code[system:89]
2024-03-12 15:09:35.278 INFO [0x1e4d42100] HandlerBase:150 | [persistent://public/default/partitioned_topic_name_test-partition-0, standalone-155-58] Schedule reconnection in 0.1 s
2024-03-12 15:09:35.279 INFO [0x1e4d42100] HandlerBase:150 | [persistent://public/default/partitioned_topic_name_test-partition-1, standalone-155-59] Schedule reconnection in 0.1 s
2024-03-12 15:09:35.279 DEBUG [0x16df5f000] BinaryProtoLookupService:148 | PartitionMetadataLookup failed for persistent://public/default/partitioned_topic_name_test, result NotConnected
2024-03-12 15:09:35.279 INFO [0x1e4d42100] HandlerBase:150 | [persistent://public/default/partitioned_topic_name_test-partition-2, standalone-155-60] Schedule reconnection in 0.1 s
As you can see the internal producers still scheduled reconnections after shutdown.
After using another topic name in test_shutdown_client, no reconnection happened:
024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:699 | 1 producers and 0 consumers have been shutdown.
2024-03-12 15:20:36.796 INFO [0x1e4d42100] ClientConnection:1325 | [127.0.0.1:63850 -> 127.0.0.1:6650] Connection disconnected (refCnt: 1)
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:705 | ConnectionPool is closed
2024-03-12 15:20:36.796 DEBUG [0x16fd93000] ClientConnection:1262 | [127.0.0.1:63850 -> 127.0.0.1:6650] Ignoring timer cancelled event, code[system:89]
2024-03-12 15:20:36.796 INFO [0x16fd93000] ClientConnection:274 | [127.0.0.1:63850 -> 127.0.0.1:6650] Destroyed connection to pulsar://localhost:6650-0
2024-03-12 15:20:36.796 DEBUG [0x16fd93000] ExecutorService:44 | Event loop of ExecutorService exits successfully
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:715 | ioExecutorProvider_ is closed
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:720 | listenerExecutorProvider_ is closed
2024-03-12 15:20:36.796 DEBUG [0x1e4d42100] ClientImpl:725 | partitionListenerExecutorProvider_ is closed