kop
kop copied to clipboard
[Cloud hosted] KoP: When the broker restarts, it immediately dies with LockBusyException and restarts again
Edit: The OOM issue is better described in streamnative/kop#1278.
This issue is now only about the LockBusyException.
This may be low priority.
Describe the bug
I have a pulsar cluster with one broker, running pulsar-cloud:2.9.2.13 in production Cloud Hosted (poolmember martin), with KoP enabled. I am producing and consuming with the Kafka protocol. I also periodically stop a consumer, to cause a rebalance, and later start a new consumer. This ran for 3 to 4 hours. During this time the broker restarted 6 times.
To Reproduce Steps to reproduce the behavior:
I'm running the application at https://github.com/streamnative/continuous-verification/tree/master/workers/kop-produce-consume against the pulsar cluster kj-kop in the organization sndev in Cloud Hosted production.
Expected behavior I do not expect the broker to restart.
Screenshots
Additional context Logs from one of the times the broker died:
2022-04-26T21:17:53,087+0000 [metadata-store-6-1] INFO org.apache.pulsar.metadata.coordination.impl.LeaderElectionImpl - Acquired leadership on /loadbalance/leader
2022-04-26T21:17:53,087+0000 [metadata-store-6-1] INFO org.apache.pulsar.broker.PulsarService - This broker was elected leader
2022-04-26T21:17:53,093+0000 [main] INFO org.apache.pulsar.broker.namespace.OwnershipCache - Trying to acquire ownership of pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff
2022-04-26T21:17:53,121+0000 [metadata-store-6-1] WARN com.github.benmanes.caffeine.cache.LocalAsyncCache - Exception thrown during asynchronous load
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
at org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.lambda$acquireLock$2(LockManagerImpl.java:111) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) ~[?:?]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$null$4(ResourceLockImpl.java:134) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) ~[?:?]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
... 7 more
2022-04-26T21:17:53,128+0000 [main] ERROR org.apache.pulsar.broker.namespace.NamespaceService - org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:329) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:350) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-04-26T21:17:53,129+0000 [main] ERROR org.apache.pulsar.broker.PulsarService - Failed to start Pulsar service: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
org.apache.pulsar.broker.PulsarServerException: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:354) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:350) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:329) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
... 4 more
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-04-26T21:17:53,129+0000 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service.
org.apache.pulsar.broker.PulsarServerException: org.apache.pulsar.broker.PulsarServerException: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:799) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:350) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
Caused by: org.apache.pulsar.broker.PulsarServerException: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:354) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
... 2 more
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:329) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
... 2 more
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
While the client workload was running, I also added a Flink cluster, which may have started the function worker on the broker.
/namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
@codelipenghui It looks like the broker failed to call NamespaceService#registerNamespace on the heartbeat namespace like the same issue we have seen in the AoP test. I think we should not simply fix this issue by not configuring the advertisedAddress. Could you provide more help to diagnose this issue?
From the logs, looks like the issue is the broker got SessionExpiredException and crashed with heap OOM
2022-04-26T21:17:37,892+0000 [main-SendThread(kj-kop-zk:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x30000033dcc0006 for sever kj-kop-zk/10.56.5.88:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 7563ms for session id 0x30000033dcc0006
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1243) [org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
Terminating due to java.lang.OutOfMemoryError: Java heap space
After the broker starts up again, run into a transient state (not able to acquire lock LockBusyException)
2022-04-26T21:17:53,121+0000 [metadata-store-6-1] WARN com.github.benmanes.caffeine.cache.LocalAsyncCache - Exception thrown during asynchronous load
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
at org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.lambda$acquireLock$2(LockManagerImpl.java:111) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) ~[?:?]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$null$4(ResourceLockImpl.java:134) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) ~[?:?]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
... 7 more
2022-04-26T21:17:53,128+0000 [main] ERROR org.apache.pulsar.broker.namespace.NamespaceService - org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:329) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:350) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-04-26T21:17:53,129+0000 [main] ERROR org.apache.pulsar.broker.PulsarService - Failed to start Pulsar service: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
org.apache.pulsar.broker.PulsarServerException: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:354) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:350) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:329) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
... 4 more
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-04-26T21:17:53,129+0000 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service.
org.apache.pulsar.broker.PulsarServerException: org.apache.pulsar.broker.PulsarServerException: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:799) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:350) [io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
Caused by: org.apache.pulsar.broker.PulsarServerException: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:354) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
... 2 more
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
at org.apache.pulsar.broker.namespace.NamespaceService.registerNamespace(NamespaceService.java:329) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.namespace.NamespaceService.registerBootstrapNamespaces(NamespaceService.java:294) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:728) ~[io.streamnative-pulsar-broker-2.9.2.13.jar:2.9.2.13]
... 2 more
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$LockBusyException: Resource at /namespace/pulsar/kj-kop/kj-kop-broker-0.kj-kop-broker-headless.sndev.svc.cluster.local:8080/0x00000000_0xffffffff is already locked
at org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl.lambda$doRevalidate$20(ResourceLockImpl.java:297) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?]
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$null$7(ZKMetadataStore.java:139) ~[io.streamnative-pulsar-metadata-2.9.2.13.jar:2.9.2.13]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
After 20 seconds, the broker started successfully.
2022-04-26T21:18:25,067+0000 [main] INFO org.apache.pulsar.broker.web.WebService - HTTP Service started at http://0.0.0.0:8080
Looks like, at a time, the broker crashed but the zookeeper lock was not released in time, I'm not sure if it is related to the JVM OOM, I think we can try to increase the broker heap memory to 256MB or 512MB to see if the problem can be reproduced, currently is 128MB, maybe not able to hold more producers and consumers.
I was able to reproduce the issue after tearing down this cluster and creating a new one with the same name (kj-kop). The broker ran successfully with light load for a little over an hour. Then I added a Flink cluster. About 15 minutes later, the broker died with OOM. When it restarted, it failed again due to the LockBusyException. When it restarted again, it ran successfully.
This is not the first time I've had OOM issues with "tiny" cloud nodes with seemingly light workloads. My concern is that if tiny nodes are unable to stay up with light workloads, why would we offer them as an option? @merlimat has suggested that we try to tune them so that performance might degrade, instead of OOM dying. The Cloud Team has a PR open to implement his suggestions streamnative/cloud-api-server#1122.
Regarding the LockBusyException, @codelipenghui mentioned that "the broker crashed but the zookeeper lock was not released in time." Do we want to take any action on this? Or have we determined that it's OK?
Just an update, I let the workload run about 8 hours today, and my broker restarted 8 times. (which could be 4 OOMs with 2 restarts each time, if it follows the pattern.)
I didn't notice any OOMs until I added the Flink cluster. Will run overnight without a Flink cluster and see what happens.
Can you try to add -XX:ErrorFile=/tmp/hs_err_pid%p.log -XX:HeapDumpPath=/tmp -XX:+HeapDumpOnOutOfMemoryError to the JVM argument, so we can have more context when the OOM happens?
Regarding the LockBusyException, @codelipenghui mentioned that "the broker crashed but the zookeeper lock was not released in time." Do we want to take any action on this? Or have we determined that it's OK?
I will check the zookeeper logs, maybe enabling debug level log for Zookeeper will help with this issue.
The first broker restart is because of OOM, this is captured in issue streamnative/kop#1278.
The second broker restart is because of the LockBusyException. I will edit the title of this issue to show it is about the LockBusyException. This may be low priority.