[Bug] An error log occurs when running healthcheck after upgrading to 3.0.5
Search before asking
- [X] I searched in the issues and found nothing similar.
Read release policy
- [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.
Version
OS: linux Java version: 17 Pulsar version: 3.0.5
Minimal reproduce step
- do healthcheck
- error log occurs
What did you expect to see?
no error log
What did you see instead?
Error log occurs at the last line.
2024-05-20T20:41:14,220+0800 [pulsar-web-42-7] INFO org.apache.pulsar.broker.admin.impl.BrokersBase - [broker-admin] Running healthCheck with topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck
2024-05-20T20:41:14,221+0800 [pulsar-io-4-104] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [null] Creating producer on cnx [id: 0x3e9b8088, L:/10.130.0.150:54620 - R:10.130.0.150/10.130.0.150:6650]
2024-05-20T20:41:14,224+0800 [broker-topic-workers-OrderedExecutor-15-0] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Created new producer: Producer{topic=SystemTopic{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck}, client=[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected], producerName=pulsarxx-3211-59, producerId=59}
2024-05-20T20:41:14,224+0800 [pulsar-io-4-104] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [pulsarxx-3211-59] Created producer on cnx [id: 0x3e9b8088, L:/10.130.0.150:54620 - R:10.130.0.150/10.130.0.150:6650]
2024-05-20T20:41:14,226+0800 [pulsar-io-4-104] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Subscribing to topic on cnx [id: 0x3e9b8088, L:/10.130.0.150:54620 - R:10.130.0.150/10.130.0.150:6650], consumerId 59
2024-05-20T20:41:14,226+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.ServerCnx - [[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected]] Subscribing on topic persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck / healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423. consumerId: 59
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Creating non-durable subscription at msg id 9223372036854775807:9223372036854775807:-1:-1 - {}
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.bookkeeper.mledger.impl.NonDurableCursorImpl - [pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck] Created non-durable cursor read-position=83555506:59 mark-delete-position=83555506:58
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck] Opened new cursor: NonDurableCursorImpl{ledger=pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck, cursor=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423, ackPos=83555506:58, readPos=83555506:59}
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] configured SUBSCRIPTION message-dispatch rate at broker DispatchRateImpl(dispatchThrottlingRateInMsg=0, dispatchThrottlingRateInByte=50000000, relativeToPublishRate=false, ratePeriodInSecond=1)
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.pulsar.broker.service.persistent.DispatchRateLimiter - setting message-dispatch-rate DispatchRateImpl(dispatchThrottlingRateInMsg=0, dispatchThrottlingRateInByte=50000000, relativeToPublishRate=false, ratePeriodInSecond=1)
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck-healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Rewind from 83555506:59 to 83555506:59
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] Disabled replicated subscriptions controller
2024-05-20T20:41:14,229+0800 [BookKeeperClientWorker-OrderedExecutor-53-0] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Created subscription on topic persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck / healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423
2024-05-20T20:41:14,229+0800 [pulsar-io-4-104] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Subscribed to topic on 10.130.0.150/10.130.0.150:6650 -- consumer: 59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.ServerCnx - [SystemTopic{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck}][pulsarxx-3211-59] Closing producer on cnx /10.130.0.150:54620. producerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.ServerCnx - [SystemTopic{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck}][pulsarxx-3211-59] Closed producer on cnx /10.130.0.150:54620. producerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-104] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [pulsarxx-3211-59] Closed Producer
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Closing consumer: consumerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.AbstractDispatcherSingleActiveConsumer - Removing consumer Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck, name=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423}, consumerId=59, consumerName=3217b, address=[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected]}
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Successfully closed subscription [NonDurableCursorImpl{ledger=pulsar/pulsarxx/10.130.0.150:8080/persistent/healthcheck, cursor=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423, ackPos=83555506:59, readPos=83555506:60}]
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.persistent.PersistentSubscription - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck][healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Successfully closed dispatcher for reader
2024-05-20T20:41:14,241+0800 [pulsar-io-4-105] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.130.0.150:54620] Closed consumer, consumerId=59
2024-05-20T20:41:14,241+0800 [pulsar-io-4-104] INFO org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck] [healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423] Closed consumer
2024-05-20T20:41:14,242+0800 [broker-client-shared-internal-executor-6-1] INFO org.apache.pulsar.broker.admin.impl.BrokersBase - [broker-admin] Successfully run health check.
2024-05-20T20:41:14,242+0800 [broker-client-shared-internal-executor-6-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [20/May/2024:20:41:14 +0800] "GET /admin/v2/brokers/health HTTP/1.1" 200 2 "-" "-" 23
2024-05-20T20:41:14,251+0800 [broker-topic-workers-OrderedExecutor-15-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck / healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423-Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsarxx/10.130.0.150:8080/healthcheck, name=healthCheck-d57f97cc-1ca2-483c-8622-2e7bb0687423}, consumerId=59, consumerName=3217b, address=[id: 0xa4af907c, L:/10.130.0.150:6650 - R:/10.130.0.150:54620] [SR:10.130.0.150, state:Connected]}] Error reading entries at 83555506:60 : Cursor was already closed - Retrying to read in 15.0 seconds
Anything else?
may be related to this pr since it raises the same exception CursorAlreadyClosedException. https://github.com/apache/pulsar/pull/22454/files#diff-fad355f91bd15cc041161f9a46fce62b7fee87fbfb8f0ff8a8b724a1bd1f29ee
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
@erobot Did I understand correctly, that the main impact is the unnecessary ERROR log entry while the healthcheck performs without other issues?
It looks we are reading entries from a closed cursor.
I searched the code, the issue may caused by concurrency:
- When a consumer added to a dispatcher, dispatcher will call
scheduleReadOnActiveConsumerto pick an active consumer, then triggerreadMoreEntries, see here - After consumer created, we'll call
receiveAsyncto read messages, ifincomingQueueis empty, send aFlowCommandto the broker,PersistentDispatherSingleActiveConsumerhandleflowcommand in an async executor, see here
After consumer added to dispatcher, read entries from BK, if read entries completed but broker hasn't had a chance to send them to the client yet, at the same time, user call receiveAsync to read messages but incomingQueue is empty, it will send a flow command to the dispatcher and dispatcher handle it in an async executor.
When the entries are sent to the client, user receive and process it and close the consumer(subscription) in the finally.
If the flow command in the async executor triggered after the cursor closed, the described issue will happen.
@erobot Did I understand correctly, that the main impact is the unnecessary ERROR log entry while the healthcheck performs without other issues?
@lhotari Yes, I have not notice other issues yet.
I got this error as well . this seems cause bookie put into quarantined state sometimes . @lhotari this caused read success rate lower , for our testing environment it is looking really bad , only 80% sometimes because we dont have many reads :)
When it is correct it shows
2024-08-11 04:51:46.766 | 2024-08-11T04:51:46,766+0000 [broker-topic-workers-OrderedExecutor-3-0] INFO org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://pulsar/pulsar/pulsar-broker-1.pulsar-broker.pulsar.svc.cluster.local:8080/healthcheck / healthCheck-005d1429-60ef-4d4b-b981-2c191806afe7-Consumer{subscription=PersistentSubscription{topic=persistent://pulsar/pulsar/pulsar-broker-1.pulsar-broker.pulsar.svc.cluster.local:8080/healthcheck, name=healthCheck-005d1429-60ef-4d4b-b981-2c191806afe7}, consumerId=1175, consumerName=25afd, address=[id: 0x6813c119, L:/10.156.139.29:6650 - R:/10.156.139.29:55440] [SR:10.156.139.29, state:Connected]}] Skipping read retry: Current Consumer null, havePendingRead false
btw, i have three clusters, one of them having this issue .