cp-helm-charts Kafka brokers run out of disk space after few days

Running standard configuration in Google cloud with ksql and connect disabled

It works fine for several days (3-4 days) with minimum usage (it's a dev environment) but eventually something occupies all available disk space

ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.log.LogManager)

java.io.IOException: No space left on device at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211) at kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:449) at kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:671) at kafka.log.Log.recoverSegment(Log.scala:652) at kafka.log.Log.recoverLog(Log.scala:788) at kafka.log.Log.$anonfun$loadSegments$3(Log.scala:724) at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at kafka.log.Log.retryOnOffsetOverflow(Log.scala:2346) at kafka.log.Log.loadSegments(Log.scala:724) at kafka.log.Log.<init>(Log.scala:298) at kafka.log.Log$.apply(Log.scala:2480) at kafka.log.LogManager.loadLog(LogManager.scala:283) at kafka.log.LogManager.$anonfun$loadLogs$12(LogManager.scala:353) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

It looks like TRACE log level is active for kafka brokers, but i'm not sure how to change it, tried with KAFKA_LOG4J_ROOT_LOGLEVEL

cp-kafka:
  persistence:
    size: 10Gi
  customEnv:
    KAFKA_LOG4J_ROOT_LOGLEVEL: WARN

but it does not make any difference

How to change log level or enable logs rotation?

Mar 16 '20 15:03 romanlv

Looks like setting log.retention should help

will try

cp-kafka:
  configurationOverrides:
    "log.retention.hours": 24

Mar 16 '20 16:03 romanlv

The question is: why is it filling up space? This is happening with a default install, and nothing is actually using the cluster aside from itself.

Apr 20 '21 00:04 zulrang

Also experiencing this issue, above retention hours doesn't seem to have worked

Apr 21 '21 03:04 BenMemi

Yep. I even tried changing all the topics to a 1GB retention as well, and it still fills up after a couple days.

Apr 25 '21 00:04 zulrang

I just deployed a cluster via operator which I guess would be the same thing as deploying it by charts. It ran out of disk immediately. I re run the deployment for kafka only like: cat confluent-kafka-only.yml apiVersion: platform.confluent.io/v1beta1 kind: Kafka [...] configOverrides: server: - log.retention.hours=4

To change the retention to something smaller, however this wont clean up existing storage which is exhausted anyway.

Can I get any help, like some guidence of how to clean up filled up log space?

I may end up redeploying the whole thing with the overides above but seems like other have similar problem even with this flag enabled.

Any help much appreciated.

At pod boot time I get: [ERROR] 2021-08-26 15:45:00,617 [pool-7-thread-1] kafka.server.LogDirFailureChannel error - Error while writing to checkpoint file /mnt/data/data0/logs/_confluent_balancer_broker_samples-13/leader-epoch-checkpoint java.io.IOException: No space left on device

Aug 26 '21 15:08 hextrim

We are facing the same issue using chart version: 0.6.1. Log retention is not addressing this issue. Increasing PVC size just delays the no space left on device.

Oct 08 '21 08:10 PiePra

I'm still seeing this issue. When I exec into the kafka broker pod, the file in question (opt/kafka/data-0...) does not even exist. Why does it say out of space when the file in question is not even there? BTW - I have all the log retention settings correct, and they show up in Confluent Control center as expected (ie, 1 hour retention, 1M size limit, etc.) It's like the kafka log retention code is not working at all.

Nov 07 '21 13:11 nlonginow

Seeing this same issue, has anyone made progress on this? We've tried overriding the log retention using both time and bytes size with no luck.

Jan 03 '22 17:01 payneBrandon

What you want to do is change the log retention policy to delete. That fixes the issue. I can drop my config file here if needed.

Jan 03 '22 19:01 BenMemi

@BenM-Mycelium thanks for the response! Are you meaning setting something like "log.cleanup.policy": "delete"?

Jan 03 '22 20:01 payneBrandon

Yes correct

Jan 03 '22 20:01 BenMemi

Testing this out now, thanks again for the help! For anyone else peeking in, an additional setting we're using that I didn't fully understand is the log.retention.bytes. When I looked at the documentation closer, this limit is enforced at the partition level and not the topic level. For my project, we're using 8 partitions (so 8x the limit that I anticipated) which left my disk woefully undersized. I'll let this run for a bit to see if the delete policy functions as expected.

Jan 03 '22 20:01 payneBrandon

How did you go out of interest?

Jan 05 '22 04:01 BenMemi

How did you go out of interest?

Hey @BenM-Mycelium, I'm just now seeing this reply sorry about that. I ended up boosting the disk size quite a bit and setting a short expiration (10 minutes) with a 0.25GB log.retention.bytes setting. At this point things are up and running and I can see the topics level off at appropriate size.

Feb 04 '22 22:02 payneBrandon

cp-helm-charts cp-helm-charts copied to clipboard

Kafka brokers run out of disk space after few days

cp-helm-charts
cp-helm-charts copied to clipboard