Add a default `FileStatisticsCache` implementation for the `ListingTable`
Is your feature request related to a problem or challenge?
Right now, when a table is created via CREATE EXTERNAL TABLE, the underyling ListingTable reads and parses the statistics for each file. If the same table is queried, the cached statistics are re-read. However, if the same files are queried again, the statistics are re-read.
-- has to read remote storage and calculate statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.
-- re-calculates the same statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.
You can see the statistics are recalculated with the timings 0.092s
If you use a CREATE EXTERNAL TABLE, which caches the statistics, it take sonly 0.025 seconds:
CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row(s) fetched.
Elapsed 0.029 seconds.
In addition, since a new cache is created for each table, the great function statistics_cache added by @nuno-faria in https://github.com/apache/datafusion/pull/19054 doesn't show anything:
> CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 0.405 seconds.
> select * from statistics_cache();
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
| path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes |
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
0 row(s) fetched.
Elapsed 0.015 seconds.
Describe the solution you'd like
I would like a session scoped FileStatisticsCache that is shared between statements / ListingTables , the same way the DefaultFilesMetadataCache is: created:
https://github.com/apache/datafusion/blob/57c0dda7b512154dd8ee9abe4f07d0462759d224/datafusion/execution/src/cache/cache_manager.rs#L165-L171
The code for ListingTable somewhat unobviously sets a DefaultFileStatisticsCache here https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259
Describe alternatives you've considered
No response
Additional context
We probably also need to add a limit, like this:
- https://github.com/apache/datafusion/issues/19052
take
Does it make sense to only create and store the FileStatisticsCache in the CacheManager and remove the creation of FileStatisticsCache from the ListingTable completely and soley reference it with with_cache? I think it's easier to reason about it.
Does it make sense to only create and store the
FileStatisticsCachein theCacheManagerand remove the creation ofFileStatisticsCachefrom theListingTablecompletely and soley reference it withwith_cache? I think it's easier to reason about it.
Yes, this would be my preference too
It is also consistent with what the rest of the system does