automq icon indicating copy to clipboard operation
automq copied to clipboard

The value of the kafka_stream_block_cache_size_bytes metric is different from the value of the s3.block.cache.size setting

Open jerome-j-20230331 opened this issue 6 months ago • 5 comments

We configured s3.block.cache.size=10737418240 (10GB) in server.properties, but the value of the kafka_stream_block_cache_size_bytes indicator seen in the s3 metrics exporter is different from the value set for s3.block.cache.size. The size shown by the indicator is less than 1G and it keeps changing. I would like to ask why this is the case.

Image

jerome-j-20230331 avatar Jul 09 '25 02:07 jerome-j-20230331

  • The s3.block.cache.size is the maximum size of data that BlockCache can cache.
  • BlockCache only caches the data that is unread or read-ahead. It will drop the non-useful data that is read.

superhx avatar Jul 10 '25 02:07 superhx

@superhx Thanks for your reply. If it is as you said, then this problem is even more strange. The size of the kafka_stream_block_cache_size_bytes metric has never exceeded the 10GB we set, but many warns like the following keep appearing in the log:

[2025-07-11 01:28:46,776] WARN [SUPPRESSED_TIME=27] The unread block is evicted, please increase the block cache size (com.automq.stream.s3.cache.blockcache.StreamReader)

[2025-07-11 01:26:15,596] WARN [SUPPRESSED_TIME=28] The unread block is evicted, please increase the block cache size (com.automq.stream.s3.cache.blockcache.StreamReader)

[2025-07-11 01:04:45,683] WARN [SUPPRESSED_TIME=28] The unread block is evicted, please increase the block cache size (com.automq.stream.s3.cache.blockcache.StreamReader)

jerome-j-20230331 avatar Jul 11 '25 01:07 jerome-j-20230331

@jerome-j-20230331 The cached DataBlock will be evicted after (createTimestamp + 1min). So it may be caused by the consumer reading too slowly to consume the data that is pre-read from S3.

Current log is a little bit misleading. The BlockCache should log different logs for different DataBlock evicted events:

  • Cache size isn't enough
  • DataBlock TTL reaches

https://github.com/AutoMQ/automq/blob/0c1a1964194ee42aca8ac4890b617dae55027af1/s3stream/src/main/java/com/automq/stream/s3/cache/blockcache/DataBlockCache.java#L236-L258

Welcome to submit a PR to fix it.

superhx avatar Jul 11 '25 02:07 superhx

@superhx Thank you very much for your answer. In fact, we are indeed facing a big consumption lag problem. We have been trying to adjust the block cache size to optimize it. Because of the log output, we always thought it was caused by insufficient memory. But I am just an operation and maintenance engineer, and I am not good at Java programming. I am sorry that I can't help contribute a little to the community.

jerome-j-20230331 avatar Jul 14 '25 06:07 jerome-j-20230331

@superhx Thank you very much for your answer. In fact, we are indeed facing a big consumption lag problem. We have been trying to adjust the block cache size to optimize it. Because of the log output, we always thought it was caused by insufficient memory. But I am just an operation and maintenance engineer, and I am not good at Java programming. I am sorry that I can't help contribute a little to the community.

The consumer lag should be fixed in consumer side. You can refer to this blog to get BlockCache performance.

superhx avatar Jul 14 '25 06:07 superhx