pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] An error log occurs when running healthcheck after upgrading to 3.0.5

Open erobot opened this issue 1 year ago • 3 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Read release policy

  • [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

OS: linux Java version: 17 Pulsar version: 3.0.5

Minimal reproduce step

  1. do healthcheck
  2. error log occurs

What did you expect to see?

no error log

What did you see instead?

Error log occurs at the last line.

2024-05-20T20:41:14,220+0800 [pulsar-web-42-7] INFO  org.apache.pulsar.broker.admin.impl.BrokersBase - [broker-admin] Running healthCheck with topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck
2024-05-20T20:41:14,221+0800 [pulsar-io-4-104] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [null] Creating producer on cnx [id: 0x3e9b8088, L:/10.130.0.150:54620 - R:10.130.0.150/10.130.0.150:6650]
2024-05-20T20:41:14,224+0800 [broker-topic-workers-OrderedExecutor-15-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Created new producer: Producer{topic=SystemTopic{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck}, client=[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected], producerName=pulsarxx-3211-59, producerId=59}
2024-05-20T20:41:14,224+0800 [pulsar-io-4-104] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [pulsarxx-3211-59] Created producer on cnx [id: 0x3e9b8088, L:/10.130.0.150:54620 - R:10.130.0.150/10.130.0.150:6650]
2024-05-20T20:41:14,226+0800 [pulsar-io-4-104] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Subscribing to topic on cnx [id: 0x3e9b8088, L:/10.130.0.150:54620 - R:10.130.0.150/10.130.0.150:6650], consumerId 59
2024-05-20T20:41:14,226+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.ServerCnx - [[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected]] Subscribing on topic persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck / healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423. consumerId: 59
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Creating non-durable subscription at msg id 9223372036854775807:9223372036854775807:-1:-1 - {}
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.bookkeeper.mledger.impl.NonDurableCursorImpl - [pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck] Created non-durable cursor read-position=83555506:59 mark-delete-position=83555506:58
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck] Opened new cursor: NonDurableCursorImpl{ledger=pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck, cursor=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423, ackPos=83555506:58, readPos=83555506:59}
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] configured SUBSCRIPTION message-dispatch rate at broker DispatchRateImpl(dispatchThrottlingRateInMsg=0, dispatchThrottlingRateInByte=50000000, relativeToPublishRate=false, ratePeriodInSecond=1)
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - setting message-dispatch-rate DispatchRateImpl(dispatchThrottlingRateInMsg=0, dispatchThrottlingRateInByte=50000000, relativeToPublishRate=false, ratePeriodInSecond=1)
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck-healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Rewind from 83555506:59 to 83555506:59
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] Disabled replicated subscriptions controller
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Created subscription on topic persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck / healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423
2024-05-20T20:41:14,229+0800 [pulsar-io-4-104] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Subscribed to topic on 10.130.0.150/10.130.0.150:6650 -- consumer: 59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.ServerCnx - [SystemTopic{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck}][pulsarxx-3211-59] Closing producer on cnx /10.130.0.150:54620. producerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.ServerCnx - [SystemTopic{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck}][pulsarxx-3211-59] Closed producer on cnx /10.130.0.150:54620. producerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-104] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [pulsarxx-3211-59] Closed Producer
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Closing consumer: consumerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.AbstractDispatcherSingleActiveConsumer - Removing consumer Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck, name=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423}, consumerId=59, consumerName=3217b, address=[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected]}
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Successfully closed subscription [NonDurableCursorImpl{ledger=pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck, cursor=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423, ackPos=83555506:59, readPos=83555506:60}]
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Successfully closed dispatcher for reader
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Closed consumer, consumerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-104] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Closed consumer
2024-05-20T20:41:14,242+0800 [broker-client-shared-internal-executor-6-1] INFO  org.apache.pulsar.broker.admin.impl.BrokersBase - [broker-admin] Successfully run health check.
2024-05-20T20:41:14,242+0800 [broker-client-shared-internal-executor-6-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [20/May/2024:20:41:14 +0800] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "-" 23
2024-05-20T20:41:14,251+0800 [broker-topic-workers-OrderedExecutor-15-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck / healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423-Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck, name=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423}, consumerId=59, consumerName=3217b, address=[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected]}] Error reading entries at 83555506:60 : Cursor was already closed - Retrying to read in 15.0 seconds

Anything else?

may be related to this pr since it raises the same exception CursorAlreadyClosedException. https://github.com/apache/pulsar/pull/22454/files#diff-fad355f91bd15cc041161f9a46fce62b7fee87fbfb8f0ff8a8b724a1bd1f29ee

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

erobot avatar May 20 '24 12:05 erobot

@erobot Did I understand correctly, that the main impact is the unnecessary ERROR log entry while the healthcheck performs without other issues?

lhotari avatar May 20 '24 17:05 lhotari

It looks we are reading entries from a closed cursor.

I searched the code, the issue may caused by concurrency:

  1. When a consumer added to a dispatcher, dispatcher will call scheduleReadOnActiveConsumer to pick an active consumer, then trigger readMoreEntries, see here
  2. After consumer created, we'll call receiveAsync to read messages, if incomingQueue is empty, send a FlowCommand to the broker, PersistentDispatherSingleActiveConsumer handle flow command in an async executor, see here

After consumer added to dispatcher, read entries from BK, if read entries completed but broker hasn't had a chance to send them to the client yet, at the same time, user call receiveAsync to read messages but incomingQueue is empty, it will send a flow command to the dispatcher and dispatcher handle it in an async executor.

When the entries are sent to the client, user receive and process it and close the consumer(subscription) in the finally. If the flow command in the async executor triggered after the cursor closed, the described issue will happen.

dao-jun avatar May 20 '24 20:05 dao-jun

@erobot Did I understand correctly, that the main impact is the unnecessary ERROR log entry while the healthcheck performs without other issues?

@lhotari Yes, I have not notice other issues yet.

erobot avatar May 21 '24 13:05 erobot

I got this error as well . this seems cause bookie put into quarantined state sometimes . @lhotari this caused read success rate lower , for our testing environment it is looking really bad , only 80% sometimes because we dont have many reads :)

When it is correct it shows

2024-08-11 04:51:46.766 | 2024-08-11T04:51:46,766+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://pulsar/pulsar/pulsar-broker-1.pulsar-broker.pulsar.svc.cluster.local:8080/healthcheck / healthCheck-005d1429-60ef-4d4b-b981-2c191806afe7-Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsar/pulsar-broker-1.pulsar-broker.pulsar.svc.cluster.local:8080/healthcheck, name=healthCheck-005d1429-60ef-4d4b-b981-2c191806afe7}, consumerId=1175, consumerName=25afd, address=[id: 0x6813c119, L:/10.156.139.29:6650 - R:/10.156.139.29:55440] [SR:10.156.139.29, state:Connected]}] Skipping read retry: Current Consumer null, havePendingRead false

btw, i have three clusters, one of them having this issue .

wallacepeng avatar Aug 10 '24 20:08 wallacepeng