datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add limit to `DefaultFileStatisticsCache`

Open alamb opened this issue 1 month ago • 9 comments

  • part of https://github.com/apache/datafusion/issues/19051

Also listing file statistics cache seems to not have any memory limit unlike metadata cache for example.

Is that by design , do you think we need to add similar limit for this cache too ?

Originally posted by @bharath-techie in https://github.com/apache/datafusion/issues/18971#issuecomment-3600061959

Basically the cache used in ListingTable comes from here: https://github.com/apache/datafusion/blob/81512da2b0aaa474f6c4ba205b05eea7b3095176/datafusion/core/src/datasource/listing_table_factory.rs#L188-L187

Which somewhat unobviously sets a DefaultFileStatisticsCache here https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259

The DefaultFileStatisticsCache has no limit: https://github.com/apache/datafusion/blob/7d8b8602ad1be2f61f6a8ebb253ace9d85304ea7/datafusion/execution/src/cache/cache_unit.rs#L41-L40

alamb avatar Dec 02 '25 20:12 alamb

My suggestion would be to implement @nuno-faria 's ideas here: https://github.com/apache/datafusion/issues/18953#issuecomment-3598406492

And then once we had a real trait, add the ability to track memory used by Statistics and then use that measurement to implement a size limited cache

This would be similar in design to @BlakeOrth 's PR here

  • https://github.com/apache/datafusion/pull/18855

I suggest a sequence of smaller PRs:

  1. Add a method to statistics like Statistics::heap_size that reports the heap (allocation) size for statistics -- following the semantics of https://github.com/apache/arrow-rs/blob/b93fa52e47a83dff130709a6e80a4be5017dbf09/parquet/src/file/metadata/memory.rs#L35
  2. Update the DefaultFileStatisticsCache to have a memory limit and use the heap_size from 1 to constrain the size
  3. Add a config parameter , perhaps datafusion.runtime.statistics_cache_limit, similar to datafusion.runtime.metadata_cache_limit, and set the default statistics cache in https://github.com/apache/datafusion/blob/71fcd03d614c466fdc8820084dc2896f69775f2b/datafusion/execution/src/cache/cache_manager.rs#L140-L141 (following the example of DefaultFilesMetadataCache)

alamb avatar Dec 02 '25 20:12 alamb

It turns out @alchemist51 also pointed this out in

  • https://github.com/apache/datafusion/pull/18855/files#r2560294003

I didn't understand at first what was meant

A little bit side topic. @alamb was looking into other caches, didn't understand why DefaultFileStatisticsCache isn't memory bound yet. Do we want to do it incrementally?

alamb avatar Dec 02 '25 20:12 alamb

take

jizezhang avatar Dec 09 '25 17:12 jizezhang

Dear @jizezhang I am already working on https://github.com/apache/datafusion/issues/19217, and this ticket is basically a requirement for it. I was wondering if it would be ok with you if I take it over. Let me know what you think.

mkleen avatar Dec 09 '25 18:12 mkleen

However, I guess it's ok, we can merge it all together.

mkleen avatar Dec 09 '25 18:12 mkleen

Hey @mkleen feel free to take it if that makes things easier.

jizezhang avatar Dec 09 '25 21:12 jizezhang

@mkleen Could you try take this issue if you still plan to work on it? It looks like I cannot unassign myself...

jizezhang avatar Dec 10 '25 15:12 jizezhang

take

mkleen avatar Dec 10 '25 15:12 mkleen

Thank you @jizezhang

mkleen avatar Dec 10 '25 16:12 mkleen