Make `DefaultListFilesCache` work better with prefixed paths
Is your feature request related to a problem or challenge?
The current implementation of the DefaultListFilesCache stores and retrieves entries from the cache using a provided Path as the key:
https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L147
When using tables that have partitions, DataFusion will attempt to list files for a specific prefix if a user's query filters can be evaluated to exact, known partition values. E.g.
select * from my_table where a=1
will use my_table/a=1/ as the Path if that partition exists.
In these scenarios, it's possible that the key for my_table with all of the files backing the table already exists in the cache, however the DefaultListFilesCache would not be able to fetch data for my_table/a=1/ because they keys would not match.
A cache miss in this scenario is undesirable for two reasons:
- DataFusion will execute a
Listrequest to backing storage to fetch a key that already exists in the cache - DataFusion will add
my_table/a=1/as a key to the cache, duplicating data in the cache
Describe the solution you'd like
I would like to enhance the DefaultListFilesCache to be "prefix aware" when attempting to fetch data.
The cache infrastructure currently allows a get_with_extra method:
https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L269
I think it should be possible to define type Extra = Path (or perhaps Vec<PathPart>?) where the extra parameter could represent the prefix, and the standard key parameter can represent the base table_url. This should allow the DefaultListFilesCache to find and filter entries for a table to the requested path prefix.
Care would need to be taken to ensure that adding prefixed data to the cache does not return incomplete results for subsequent queries to the table.
Describe alternatives you've considered
It's possible using a different keying mechanism for the cache entirely could work as well. There's likely a potential solution that uses key: Vec<PathPart> or something similar and have the cache itself internalize the management of understanding when entries may or may not match. The difficulty here would likely be efficiently determining which parts of a path belong to a table vs a prefix.
Additional context
No response
take
@BlakeOrth
ok so should it be like
making the DefaultListFilesCache "prefix-aware" by modifying the get_with_extra method to accept the table's base path as the Extra parameter. And now when a cache lookup occurs for a partition-specific prefix (for say, table/a=1/), the cache will-
- Try an exact key match (current impl)
- If that fails, check if a parent prefix (the table base path passed as Extra) exists in the cache
and if found, then will filter the cached
ObjectMetaentries to only return files whose location starts with the requested prefix. - And finally return the filtered results without making a storage call.
Also would request @Jefffrey @martin-g @alamb to guide a bit 😅 !
@Yuvraj-cyborg I think you generally have the idea right!
modifying the get_with_extra method to accept the table's base path as the Extra parameter
In my mind this should be the other way around. I think key should always be the table's base path, and extra should be the prefix. This way key remains the same between get and get_with_extra and the prefix becomes an optional addition to a cache lookup. This also generally remains consistent with how the actual listing calls are structured here:
https://github.com/apache/datafusion/blob/bde16083ad344b7a52db5cb298a15d7434ffde51/datafusion/datasource/src/url.rs#L238
This should also allow us to (internally) always call get_with_extra from get to avoid some code duplication.
Using the above makes the cache lookup subtly different from what you've described, where a lookup for the key will always return the entry for the table (if it exists), and then the contents extra can be used to further filter the results returned by the cache. So to use your example:
get_with_extra(key="my_table", extra="a=1")
- Fetch the cache entry for
my_table - Discard any entries that do not match
my_table/a=1 - Return filtered results without making a storage call