self-hosted
self-hosted copied to clipboard
Kafka/zookeeper fatal error when disk runs out
Self-Hosted Version
24.500
CPU Architecture
x86_64
Docker Version
26.1.4
Docker Compose Version
2.27.1
Steps to Reproduce
Install self hosted. Run out of disk. Kafka/zookeeper will fail. Impossible to recover (see logs), my installation is doomed.
Expected Result
The service should not break to a point it cannot be recovered. Maybe check the disk and kill itself. I'd rather loose a bunch of transactions than losing everything.
Actual Result
===> Launching kafka ...
[2024-06-17 04:55:46,504] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
[2024-06-17 04:55:47,568] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2024-06-17 04:55:47,915] INFO Updated connection-accept-rate max connection creation rate to 2147483647 (kafka.network.ConnectionQuotas)
[2024-06-17 04:55:47,936] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Created data-plane acceptor and processors for endpoint : ListenerName(PLAINTEXT) (kafka.network.SocketServer)
[2024-06-17 04:55:48,020] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,033] INFO Stat of the created znode at /brokers/ids/1001 is: 1478,1478,1718600148028,1718600148028,1,0,0,72130214439944228,194,0,1478
(kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,034] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: PLAINTEXT://kafka:9092, czxid (broker epoch): 1478 (kafka.zk.KafkaZkClient)
[2024-06-17 04:55:48,242] INFO [/config/changes-event-process-thread]: Starting (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread)
[2024-06-17 04:55:48,259] WARN [Controller id=1001, targetBrokerId=1001] Connection to node 1001 (kafka/172.19.0.13:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2024-06-17 04:55:48,260] WARN [RequestSendThread controllerId=1001] Controller 1001's connection to broker kafka:9092 (id: 1001 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to kafka:9092 (id: 1001 rack: null) failed.
at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:298)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:251)
at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
[2024-06-17 04:55:48,341] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1001] Enabling request processing. (kafka.network.SocketServer)
[2024-06-17 04:55:48,344] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.DataPlaneAcceptor)
[2024-06-17 04:56:20,444] ERROR Error while appending records to ingest-transactions-0 in dir /var/lib/kafka/data (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.io.IOException: No space left on device
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:90)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:160)
at kafka.log.LocalLog.append(LocalLog.scala:439)
at kafka.log.UnifiedLog.append(UnifiedLog.scala:911)
at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1277)
at scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
at scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
at scala.collection.mutable.HashMap.map(HashMap.scala:35)
at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1265)
at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:868)
at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:686)
at kafka.server.KafkaApis.handle(KafkaApis.scala:180)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:153)
at java.base/java.lang.Thread.run(Thread.java:829)
[2024-06-17 04:56:20,445] WARN [ReplicaManager broker=1001] Stopping serving replicas in dir /var/lib/kafka/data (kafka.server.ReplicaManager)
[2024-06-17 04:56:20,464] WARN [ReplicaManager broker=1001] Broker 1001 stopped fetcher for partitions snuba-queries-0,outcomes-0,scheduled-subscriptions-transactions-0,events-0,cdc-0,profiles-call-tree-0,snuba-generic-metrics-sets-commit-log-0,__consumer_offsets-0,scheduled-subscriptions-events-0,outcomes-billing-0,ingest-performance-metrics-0,events-subscription-results-0,snuba-dead-letter-generic-events-0,transactions-0,snuba-dead-letter-replays-0,processed-profiles-0,snuba-dead-letter-metrics-0,snuba-attribution-0,scheduled-subscriptions-generic-metrics-distributions-0,snuba-generic-metrics-counters-commit-log-0,ingest-events-0,metrics-subscription-results-0,snuba-generic-metrics-gauges-commit-log-0,profiles-0,scheduled-subscriptions-generic-metrics-counters-0,scheduled-subscriptions-generic-metrics-sets-0,scheduled-subscriptions-generic-metrics-gauges-0,generic-metrics-subscription-results-0,snuba-transactions-commit-log-0,snuba-spans-0,ingest-replay-events-0,ingest-sessions-0,ingest-transactions-0,ingest-attachments-0,snuba-metrics-0,monitors-clock-tick-0,snuba-metrics-summaries-0,snuba-dead-letter-group-attributes-0,shared-resources-usage-0,ingest-monitors-0,ingest-occurrences-0,transactions-subscription-results-0,generic-events-0,snuba-dead-letter-generic-metrics-0,snuba-metrics-commit-log-0,ingest-metrics-0,group-attributes-0,snuba-generic-metrics-0,event-replacements-0,snuba-dead-letter-querylog-0,snuba-commit-log-0,snuba-generic-metrics-distributions-commit-log-0,ingest-replay-recordings-0,snuba-generic-events-commit-log-0,scheduled-subscriptions-metrics-0 and stopped moving logs for partitions because they are in the failed log directory /var/lib/kafka/data. (kafka.server.ReplicaManager)
[2024-06-17 04:56:20,464] WARN Stopping serving logs in dir /var/lib/kafka/data (kafka.log.LogManager)
[2024-06-17 04:56:20,466] ERROR Shutdown broker because all log dirs in /var/lib/kafka/data have failed (kafka.log.LogManager)
And zookeepers' logs
Using log4j config /etc/kafka/log4j.properties
===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
Running in Zookeeper mode...
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...
===> Check if Zookeeper is healthy ...
[2024-06-17 06:00:49,813] ERROR Unable to resolve address: zookeeper:2181 (org.apache.zookeeper.client.StaticHostProvider)
java.net.UnknownHostException: zookeeper: Name or service not known
at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:930)
at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1543)
at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1386)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1307)
at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:88)
at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:141)
at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:368)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1204)
[2024-06-17 06:00:49,818] WARN Session 0x0 for server zookeeper:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn)
Shutting down and restarting fails with dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy
Reinstalling fails with
dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy
Error in install/bootstrap-snuba.sh:3.
'$dcr snuba-api bootstrap --no-migrate --force' exited with status 1
-> ./install.sh:main:36
--> install/bootstrap-snuba.sh:source:3
Tried to follow the troubleshooting guide
sentry@workhorse:~/self-hosted$ docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
[+] Creating 1/0
✔ Container sentry-self-hosted-zookeeper-1 Created 0.0s
[+] Running 1/1
✔ Container sentry-self-hosted-zookeeper-1 Started 0.4s
dependency failed to start: container sentry-self-hosted-zookeeper-1 is unhealthy
Tried the nuclear option
sentry@workhorse:~/self-hosted$ docker compose down --volumes
[+] Running 13/13
✔ Container sentry-self-hosted-kafka-1 Removed 0.0s
✔ Container sentry-self-hosted-clickhouse-1 Removed 0.0s
✔ Container sentry-self-hosted-redis-1 Removed 0.0s
✔ Container sentry-self-hosted-zookeeper-1 Removed 0.1s
✔ Volume sentry-self-hosted_sentry-clickhouse-log Removed 0.0s
✔ Volume sentry-self-hosted_sentry-vroom Removed 0.4s
✔ Volume sentry-self-hosted_sentry-secrets Removed 0.0s
✔ Volume sentry-self-hosted_sentry-kafka-log Removed 0.4s
✔ Volume sentry-self-hosted_sentry-smtp Removed 0.4s
✔ Volume sentry-self-hosted_sentry-smtp-log Removed 0.4s
✔ Volume sentry-self-hosted_sentry-nginx-cache Removed 0.4s
✔ Volume sentry-self-hosted_sentry-zookeeper-log Removed 0.4s
✔ Network sentry-self-hosted_default Removed 0.1s
sentry@workhorse:~/self-hosted$ docker volume rm sentry-kafka
sentry-kafka
sentry@workhorse:~/self-hosted$ docker volume rm sentry-zookeeper
sentry-zookeeper
But then reinstall fails
Volume "sentry-self-hosted_sentry-nginx-cache" Created
external volume "sentry-zookeeper" not found
Error in install/upgrade-clickhouse.sh:15.
'$dc up -d clickhouse' exited with status 1
-> ./install.sh:main:25
--> install/upgrade-clickhouse.sh:source:15
Event ID
No response
Worst thing: I've removed everything (docker system prune -a), but now install always fails due to the missing volume.
Apparently, docker system prune -a does not clean the volumes if the location is not standard and that's the reason for the problem reinstalling.
to get it working again you need to do.
docker volume create sentry-zookeeper docker volume create sentry-kafka
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀
Could the documentation be updated with mention to this issue and how to properly resolve it? I've had the problem a couple of times already, and fight to prevent disk space issues, but fail. It would be really nice if kafka would not fail so critically.
It's already documented: https://develop.sentry.dev/self-hosted/troubleshooting/#nuclear-option (see "Nuclear option" under the "Kafka" section).
Kafka can be deleted pretty safely since it only contains unprocessed data. So yes you will lose some data but only data that was not processed into the database yet so if you have to loose anything Kafka is not that bad fortunately.
Unfortunately I think there isn't much we can do to prevent Kafka from corrupting itself once your disk is full (this is probably more of a Kafka problem than a Sentry problem). You should add monitoring for disk space reaching critical levels and act before that happens or provision with a bit more room or accept you once in a while have to reset Kafka.
Anyway, hope the link will help if you need to resolve this in the future.
This issue has gone three weeks without activity. In another week, I will close it.
But! If you comment or otherwise update it, I will reset the clock, and if you remove the label Waiting for: Community, I will leave it alone ... forever!
"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀