datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add runtime config options for `list_files_cache_limit` and `list_files_cache_ttl`

Open delamarch3 opened this issue 3 weeks ago • 3 comments

Which issue does this PR close?

  • Closes #19056

Rationale for this change

Make the list file cache memory limit and TTL configurable via runtime config.

What changes are included in this PR?

  • Add ability to SET and RESET list_files_cache_limit and list_files_cache_ttl
  • list_files_cache_ttl will expect the duration to look like either 1m30s or 30 (I'm wondering if it would be simpler for it to just accept a single unit?)
  • Add update_cache_ttl() to the ListFilesCache trait so we can update it from RuntimeEnvBuilder::build()
  • Add config entries

Are these changes tested?

Yes

Are there any user-facing changes?

update_cache_ttl() has been added to the ListFilesCache trait

delamarch3 avatar Dec 05 '25 12:12 delamarch3

Hi @BlakeOrth @alamb, this is ready for review

delamarch3 avatar Dec 08 '25 12:12 delamarch3

@BlakeOrth Yep, you can run reset datafusion.runtime.list_files_cache_ttl to set it back to None

delamarch3 avatar Dec 08 '25 20:12 delamarch3

Thanks @delamarch3 and @BlakeOrth -- I'll try and check this out soon

alamb avatar Dec 10 '25 12:12 alamb

This confused me for a while but it makes sense to me and will be fixed with

Ah sorry, should have been more clear, I meant in the datafusion-cli. I configured it like this to test it out:

diff --git a/datafusion-cli/src/main.rs b/datafusion-cli/src/main.rs
index de666fced..abed30ea3 100644
--- a/datafusion-cli/src/main.rs
+++ b/datafusion-cli/src/main.rs
@@ -23,6 +23,8 @@ use std::process::ExitCode;
 use std::sync::{Arc, LazyLock};

 use datafusion::error::{DataFusionError, Result};
+use datafusion::execution::cache::cache_manager::CacheManagerConfig;
+use datafusion::execution::cache::DefaultListFilesCache;
 use datafusion::execution::context::SessionConfig;
 use datafusion::execution::memory_pool::{
     FairSpillPool, GreedyMemoryPool, MemoryPool, TrackConsumersPool,
@@ -222,6 +224,11 @@ async fn main_inner() -> Result<()> {
     );
     rt_builder = rt_builder.with_object_store_registry(instrumented_registry.clone());

+    rt_builder = rt_builder.with_cache_manager(
+        CacheManagerConfig::default()
+            .with_list_files_cache(Some(Arc::new(DefaultListFilesCache::default()))),
+    );
+
     let runtime_env = rt_builder.build_arc()?;

     // enable dynamic file query

delamarch3 avatar Dec 14 '25 13:12 delamarch3

This confused me for a while but it makes sense to me and will be fixed with

Ah sorry, should have been more clear, I meant in the datafusion-cli. I configured it like this to test it out:

diff --git a/datafusion-cli/src/main.rs b/datafusion-cli/src/main.rs
index de666fced..abed30ea3 100644
--- a/datafusion-cli/src/main.rs
+++ b/datafusion-cli/src/main.rs
@@ -23,6 +23,8 @@ use std::process::ExitCode;
 use std::sync::{Arc, LazyLock};

 use datafusion::error::{DataFusionError, Result};
+use datafusion::execution::cache::cache_manager::CacheManagerConfig;
+use datafusion::execution::cache::DefaultListFilesCache;
 use datafusion::execution::context::SessionConfig;
 use datafusion::execution::memory_pool::{
     FairSpillPool, GreedyMemoryPool, MemoryPool, TrackConsumersPool,
@@ -222,6 +224,11 @@ async fn main_inner() -> Result<()> {
     );
     rt_builder = rt_builder.with_object_store_registry(instrumented_registry.clone());

+    rt_builder = rt_builder.with_cache_manager(
+        CacheManagerConfig::default()
+            .with_list_files_cache(Some(Arc::new(DefaultListFilesCache::default()))),
+    );
+
     let runtime_env = rt_builder.build_arc()?;

     // enable dynamic file query

No worries -- I think it is a temporary situation so we should be good to go

alamb avatar Dec 14 '25 14:12 alamb

Thanks @delamarch3 and @BlakeOrth

alamb avatar Dec 16 '25 12:12 alamb