automq icon indicating copy to clipboard operation
automq copied to clipboard

perf(core): batch persistent meta when delete large scale segment

Open lifepuzzlefun opened this issue 6 months ago • 0 comments

log

[2024-07-22 10:30:21,309] ERROR Uncaught exception in scheduled task 'delete-file' (org.apache.kafka.server.util.KafkaScheduler)
java.lang.OutOfMemoryError: Java heap space
        at com.fasterxml.jackson.core.util.ByteArrayBuilder.toByteArray(ByteArrayBuilder.java:163)
        at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsBytes(ObjectWriter.java:1164)
        at kafka.log.streamaspect.ElasticLogMeta.encode(ElasticLogMeta.java:53)
        at kafka.log.streamaspect.ElasticLogSegmentManager.asyncPersistLogMeta(ElasticLogSegmentManager.java:80)
        at kafka.log.streamaspect.ElasticLogSegmentManager$EventListener.onEvent(ElasticLogSegmentManager.java:153)
        at kafka.log.streamaspect.ElasticLogSegment.deleteIfExists(ElasticLogSegment.java:410)
        at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:950)
        at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5$adapted(LocalLog.scala:949)
        at kafka.log.LocalLog$$$Lambda$5148/0x0000000801ba09c0.apply(Unknown Source)
        at scala.collection.immutable.List.foreach(List.scala:334)
        at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$4(LocalLog.scala:949)
        at kafka.log.LocalLog$.deleteSegments$1(LocalLog.scala:739)
        at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$6(LocalLog.scala:956)
        at kafka.log.LocalLog$$$Lambda$5145/0x0000000801455d88.run(Unknown Source)
        at org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
        at org.apache.kafka.server.util.KafkaScheduler$$Lambda$1237/0x000000080155c7c8.run(Unknown Source)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

this cause one of broker OOM when 1gb/s topic retention time change from 3 days to 12h. after the delete change from 200TB to 18TB. the partition number is 100. the delete segment at the same time is very very big.

and the OOM cause the kraft ping packet will be error and the broker remain in fence can't recover

lifepuzzlefun avatar Jul 30 '24 06:07 lifepuzzlefun