pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] [txn] can not do txn-produce after several pulsar-perf op and cluster restart

Open TakaHiR07 opened this issue 2 years ago • 0 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Version

  • os: Linux 4.19.160-0419160-generic x86_64
  • pulsar-server: 2.9.3
  • pulsar-client: 2.9.2
  • 3 nodes cluster
  • use pulsar-perf to simulate produce/consume, add config "-txn -nmt 50" to enable txn-produce or txn-consume
  • server conf: transactionCoordinatorEnabled=true, acknowledgmentAtBatchIndexLevelEnabled=false, E-Qw-Qa is 3-3-2

Minimal reproduce step

  1. use pulsar-perf to do several perf-test, both txn and non-txn produce/consume work well
  2. during the test, cluster have several time restart broker and bookie
  3. txn-produce failed, but non-txn-produce succeed

What did you expect to see?

both txn-produce and non-txn-produce succeed

What did you see instead?

Broker log do not have any error log, and as the following log shown, it look like the operation timeout is over 30s and client close the producer. But although I add operation-timeout client-config in pulsar-perf and change it to a large value 8min, the problem also occur.

And if I remove "-txn -nmt 50" in pulsar-perf, change it to non-txn produce, it can succeed. The problem seems refer to the txn module.

Besides, although I restart broker and bookie again, txn-produce still fail. I don't know how to solve it without cleanup the whole cluster.

client error log:

15:58:33.534 [pulsar-client-io-3-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://test/test/test_txn_perf_tmp-partition-2] [null] Failed to create producer: request timeout {'durationMs': '30000'}
15:58:33.534 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://test/test/test_txn_perf_tmp-partition-2] [null] Failed to create producer: request timeout {'durationMs': '30000'}
15:58:33.534 [pulsar-client-io-2-1] ERROR org.apache.pulsar.client.impl.PartitionedProducerImpl - [persistent://test/test/test_txn_perf_tmp] Could not create partitioned producer.
15:58:33.534 [pulsar-client-io-3-1] ERROR org.apache.pulsar.client.impl.PartitionedProducerImpl - [persistent://test/test/test_txn_perf_tmp] Could not create partitioned producer.
15:58:33.534 [pulsar-client-io-1-1] ERROR org.apache.pulsar.client.impl.PartitionedProducerImpl - [persistent://test/test/test_txn_perf_tmp] Could not create partitioned producer.
15:58:33.538 [pulsar-perf-producer-exec-1-3] ERROR perf.PerformanceProducer - Got error
java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: request timeout {'durationMs': '30000'}
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_121]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) ~[?:1.8.0_121]
        at perf.PerformanceProducer.runProducer(PerformanceProducer.java:540)

server log:

17:15:32.715 [pulsar-io-27-15] INFO  org.apache.pulsar.broker.service.ServerCnx - New connection from 
17:15:32.715 [pulsar-io-27-16] INFO  org.apache.pulsar.broker.service.ServerCnx - New connection from 
17:15:32.715 [pulsar-io-27-17] INFO  org.apache.pulsar.broker.service.ServerCnx - New connection from 
17:15:32.721 [pulsar-io-27-16] INFO  org.apache.pulsar.broker.service.ServerCnx -[persistent://test/test/test_txn_perf_tmp-partition-0] Creating producer. producerId=0
17:15:32.721 [pulsar-io-27-15] INFO  org.apache.pulsar.broker.service.ServerCnx - [persistent://test/test/test_txn_perf_tmp-partition-0] Creating producer. producerId=0
17:15:32.721 [pulsar-io-27-17] INFO  org.apache.pulsar.broker.service.ServerCnx - [persistent://test/test/test_txn_perf_tmp-partition-0] Creating producer. producerId=0
17:15:32.722 [ForkJoinPool.commonPool-worker-43] INFO  org.apache.pulsar.broker.service.ServerCnx -  persistent://test/test/test_txn_perf_tmp-partition-0 configured with schema false
17:15:32.722 [ForkJoinPool.commonPool-worker-50] INFO  org.apache.pulsar.broker.service.ServerCnx -  persistent://test/test/test_txn_perf_tmp-partition-0 configured with schema false
17:15:32.722 [ForkJoinPool.commonPool-worker-50] INFO  org.apache.pulsar.broker.service.ServerCnx -  persistent://test/test/test_txn_perf_tmp-partition-0 configured with schema false
17:16:32.716 [pulsar-io-27-16] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed producer before its creation was completed. producerId=0
17:16:32.726 [pulsar-io-27-17] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed producer before its creation was completed. producerId=1
17:16:32.726 [pulsar-io-27-15] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed producer before its creation was completed. producerId=2
17:16:32.746 [pulsar-io-27-15] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from 
17:16:32.746 [pulsar-io-27-16] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from 
17:16:32.746 [pulsar-io-27-17] INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from 

Anything else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

TakaHiR07 avatar Aug 10 '22 09:08 TakaHiR07

@congbobo184 @liangyepianzhou Could you please help check this issue?

codelipenghui avatar Aug 12 '22 12:08 codelipenghui

I have added many log in the code, and found that it is blocked in topic.checkIfTransactionBufferRecoverCompletely in handleProducer().

Then I found that the object transactionBufferFuture in TopicTransactionBuffer was not completed, because it did not finish TopicTransactionBufferRecover#run().

The problem was blocked in the reader.readNext() method. I found that reader can read the 1,2,3 message successfully, but when reader read the 4th message, it would block and thus over the 30s timeout.

企业微信截图_d1504942-585c-40ac-9fab-e819d8bd9cf8

TakaHiR07 avatar Aug 15 '22 02:08 TakaHiR07

hi @TakaHiR07 , sorry it took a long time to reply. https://github.com/apache/pulsar/pull/17051 This Pr will fix the reader.hasMoreEvents has been blocked problem. please re-build the branch-2.9 or wait for tag 2.9.4. If still have the same problem, please ping me.

congbobo184 avatar Aug 30 '22 04:08 congbobo184

hi @TakaHiR07 , sorry it took a long time to reply. #17051 This Pr will fix the reader.hasMoreEvents has been blocked problem. please re-build the branch-2.9 or wait for tag 2.9.4. If still have the same problem, please ping me.

@congbobo184 I re-build the branch-2.9, restart my cluster. This time I can successfully txn_produce on some topic, but when I try txn_produce on a 10-partition topic, client still has the same timeout exception.

The client log is the same as before. For the server log, this time it seems some partition has finish the transaction recover and can successfully create producer. But some other partition do not finish.

Besides, I observe this log between I restart cluster and create txn_producer, which is not occur before. I don't know whether it is the possible reason.

18:27:19.260 [transaction_coordinator_TransactionCoordinatorID(id=6)thread_factory-70-1] ERROR org.apache.pulsar.broker.TransactionMetadataStoreService - End transaction fail! TxnId : (6,2309004), TxnAction : 1 org.apache.pulsar.transaction.coordinator.exceptions.CoordinatorException$CoordinatorNotFoundException: Transaction coordinator with id 6 not found! 18:27:19.260 [transaction-coordinator-thread-factory-66-1] INFO org.apache.pulsar.broker.TransactionMetadataStoreService - Added new transaction meta store TransactionCoordinatorID(id=6)

TakaHiR07 avatar Sep 02 '22 11:09 TakaHiR07

@TakaHiR07 Hi what version of the client you are using?

18:27:19.260 [transaction_coordinator_TransactionCoordinatorID(id=6)thread_factory-70-1] ERROR org.apache.pulsar.broker.TransactionMetadataStoreService - End transaction fail! TxnId : (6,2309004), TxnAction : 1 org.apache.pulsar.transaction.coordinator.exceptions.CoordinatorException$CoordinatorNotFoundException: Transaction coordinator with id 6 not found! 18:27:19.260 [transaction-coordinator-thread-factory-66-1] INFO org.apache.pulsar.broker.TransactionMetadataStoreService - Added new transaction meta store TransactionCoordinatorID(id=6)

this log is normal and does not affect behavior, we will change it to warn or debug level.

for creating a producer timeout problem, can you conveniently provide the jstack, dump and log file? https://github.com/apache/pulsar/issues/17040#issuecomment-1214544142 Is the thread still stuck here?

congbobo184 avatar Sep 05 '22 07:09 congbobo184

@TakaHiR07 Hi what version of the client you are using?

18:27:19.260 [transaction_coordinator_TransactionCoordinatorID(id=6)thread_factory-70-1] ERROR org.apache.pulsar.broker.TransactionMetadataStoreService - End transaction fail! TxnId : (6,2309004), TxnAction : 1 org.apache.pulsar.transaction.coordinator.exceptions.CoordinatorException$CoordinatorNotFoundException: Transaction coordinator with id 6 not found! 18:27:19.260 [transaction-coordinator-thread-factory-66-1] INFO org.apache.pulsar.broker.TransactionMetadataStoreService - Added new transaction meta store TransactionCoordinatorID(id=6)

this log is normal and does not affect behavior, we will change it to warn or debug level.

for creating a producer timeout problem, can you conveniently provide the jstack, dump and log file? #17040 (comment) Is the thread still stuck here?

Oh, still 2.9.2, I notice the pr-17040 fix on client side, not on broker side. I would try to update the client version and test again. Thanks a lot.

TakaHiR07 avatar Sep 05 '22 07:09 TakaHiR07