datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Make `DefaultListFilesCache` work better with prefixed paths

Open BlakeOrth opened this issue 4 weeks ago • 3 comments

Is your feature request related to a problem or challenge?

The current implementation of the DefaultListFilesCache stores and retrieves entries from the cache using a provided Path as the key: https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L147

When using tables that have partitions, DataFusion will attempt to list files for a specific prefix if a user's query filters can be evaluated to exact, known partition values. E.g.

select * from my_table where a=1

will use my_table/a=1/ as the Path if that partition exists.

In these scenarios, it's possible that the key for my_table with all of the files backing the table already exists in the cache, however the DefaultListFilesCache would not be able to fetch data for my_table/a=1/ because they keys would not match.

A cache miss in this scenario is undesirable for two reasons:

  1. DataFusion will execute a List request to backing storage to fetch a key that already exists in the cache
  2. DataFusion will add my_table/a=1/ as a key to the cache, duplicating data in the cache

Describe the solution you'd like

I would like to enhance the DefaultListFilesCache to be "prefix aware" when attempting to fetch data.

The cache infrastructure currently allows a get_with_extra method: https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L269

I think it should be possible to define type Extra = Path (or perhaps Vec<PathPart>?) where the extra parameter could represent the prefix, and the standard key parameter can represent the base table_url. This should allow the DefaultListFilesCache to find and filter entries for a table to the requested path prefix.

Care would need to be taken to ensure that adding prefixed data to the cache does not return incomplete results for subsequent queries to the table.

Describe alternatives you've considered

It's possible using a different keying mechanism for the cache entirely could work as well. There's likely a potential solution that uses key: Vec<PathPart> or something similar and have the cache itself internalize the management of understanding when entries may or may not match. The difficulty here would likely be efficiently determining which parts of a path belong to a table vs a prefix.

Additional context

No response

BlakeOrth avatar Dec 11 '25 00:12 BlakeOrth

take

Yuvraj-cyborg avatar Dec 11 '25 12:12 Yuvraj-cyborg

@BlakeOrth

ok so should it be like making the DefaultListFilesCache "prefix-aware" by modifying the get_with_extra method to accept the table's base path as the Extra parameter. And now when a cache lookup occurs for a partition-specific prefix (for say, table/a=1/), the cache will-

  • Try an exact key match (current impl)
  • If that fails, check if a parent prefix (the table base path passed as Extra) exists in the cache and if found, then will filter the cached ObjectMeta entries to only return files whose location starts with the requested prefix.
  • And finally return the filtered results without making a storage call.

Also would request @Jefffrey @martin-g @alamb to guide a bit 😅 !

Yuvraj-cyborg avatar Dec 11 '25 13:12 Yuvraj-cyborg

@Yuvraj-cyborg I think you generally have the idea right!

modifying the get_with_extra method to accept the table's base path as the Extra parameter

In my mind this should be the other way around. I think key should always be the table's base path, and extra should be the prefix. This way key remains the same between get and get_with_extra and the prefix becomes an optional addition to a cache lookup. This also generally remains consistent with how the actual listing calls are structured here: https://github.com/apache/datafusion/blob/bde16083ad344b7a52db5cb298a15d7434ffde51/datafusion/datasource/src/url.rs#L238

This should also allow us to (internally) always call get_with_extra from get to avoid some code duplication.

Using the above makes the cache lookup subtly different from what you've described, where a lookup for the key will always return the entry for the table (if it exists), and then the contents extra can be used to further filter the results returned by the cache. So to use your example:

get_with_extra(key="my_table", extra="a=1")

  1. Fetch the cache entry for my_table
  2. Discard any entries that do not match my_table/a=1
  3. Return filtered results without making a storage call

BlakeOrth avatar Dec 11 '25 18:12 BlakeOrth