Add limit to `DefaultFileStatisticsCache`
- part of https://github.com/apache/datafusion/issues/19051
Also listing file statistics cache seems to not have any memory limit unlike metadata cache for example.
Is that by design , do you think we need to add similar limit for this cache too ?
Originally posted by @bharath-techie in https://github.com/apache/datafusion/issues/18971#issuecomment-3600061959
Basically the cache used in ListingTable comes from here: https://github.com/apache/datafusion/blob/81512da2b0aaa474f6c4ba205b05eea7b3095176/datafusion/core/src/datasource/listing_table_factory.rs#L188-L187
Which somewhat unobviously sets a DefaultFileStatisticsCache here
https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259
The DefaultFileStatisticsCache has no limit:
https://github.com/apache/datafusion/blob/7d8b8602ad1be2f61f6a8ebb253ace9d85304ea7/datafusion/execution/src/cache/cache_unit.rs#L41-L40
My suggestion would be to implement @nuno-faria 's ideas here: https://github.com/apache/datafusion/issues/18953#issuecomment-3598406492
And then once we had a real trait, add the ability to track memory used by Statistics and then use that measurement to implement a size limited cache
This would be similar in design to @BlakeOrth 's PR here
- https://github.com/apache/datafusion/pull/18855
I suggest a sequence of smaller PRs:
- Add a method to statistics like
Statistics::heap_sizethat reports the heap (allocation) size for statistics -- following the semantics of https://github.com/apache/arrow-rs/blob/b93fa52e47a83dff130709a6e80a4be5017dbf09/parquet/src/file/metadata/memory.rs#L35 - Update the
DefaultFileStatisticsCacheto have a memory limit and use the heap_size from 1 to constrain the size - Add a config parameter , perhaps
datafusion.runtime.statistics_cache_limit, similar todatafusion.runtime.metadata_cache_limit, and set the default statistics cache in https://github.com/apache/datafusion/blob/71fcd03d614c466fdc8820084dc2896f69775f2b/datafusion/execution/src/cache/cache_manager.rs#L140-L141 (following the example ofDefaultFilesMetadataCache)
It turns out @alchemist51 also pointed this out in
- https://github.com/apache/datafusion/pull/18855/files#r2560294003
I didn't understand at first what was meant
A little bit side topic. @alamb was looking into other caches, didn't understand why DefaultFileStatisticsCache isn't memory bound yet. Do we want to do it incrementally?
take
Dear @jizezhang I am already working on https://github.com/apache/datafusion/issues/19217, and this ticket is basically a requirement for it. I was wondering if it would be ok with you if I take it over. Let me know what you think.
However, I guess it's ok, we can merge it all together.
Hey @mkleen feel free to take it if that makes things easier.
@mkleen Could you try take this issue if you still plan to work on it? It looks like I cannot unassign myself...
take
Thank you @jizezhang