datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add a default `FileStatisticsCache` implementation for the `ListingTable`

Open alamb opened this issue 1 month ago • 2 comments

Is your feature request related to a problem or challenge?

Right now, when a table is created via CREATE EXTERNAL TABLE, the underyling ListingTable reads and parses the statistics for each file. If the same table is queried, the cached statistics are re-read. However, if the same files are queried again, the statistics are re-read.

-- has to read remote storage and calculate statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.
-- re-calculates the same statistics
select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
Elapsed 0.092 seconds.

You can see the statistics are recalculated with the timings 0.092s

If you use a CREATE EXTERNAL TABLE, which caches the statistics, it take sonly 0.025 seconds:

CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.029 seconds.

In addition, since a new cache is created for each table, the great function statistics_cache added by @nuno-faria in https://github.com/apache/datafusion/pull/19054 doesn't show anything:

> CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 0.405 seconds.

> select * from statistics_cache();
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
| path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes |
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
0 row(s) fetched.
Elapsed 0.015 seconds.

Describe the solution you'd like

I would like a session scoped FileStatisticsCache that is shared between statements / ListingTables , the same way the DefaultFilesMetadataCache is: created:

https://github.com/apache/datafusion/blob/57c0dda7b512154dd8ee9abe4f07d0462759d224/datafusion/execution/src/cache/cache_manager.rs#L165-L171

The code for ListingTable somewhat unobviously sets a DefaultFileStatisticsCache here https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259

Describe alternatives you've considered

No response

Additional context

We probably also need to add a limit, like this:

  • https://github.com/apache/datafusion/issues/19052

alamb avatar Dec 08 '25 18:12 alamb

take

mkleen avatar Dec 09 '25 07:12 mkleen

Does it make sense to only create and store the FileStatisticsCache in the CacheManager and remove the creation of FileStatisticsCache from the ListingTable completely and soley reference it with with_cache? I think it's easier to reason about it.

mkleen avatar Dec 10 '25 12:12 mkleen

Does it make sense to only create and store the FileStatisticsCache in the CacheManager and remove the creation of FileStatisticsCache from the ListingTable completely and soley reference it with with_cache? I think it's easier to reason about it.

Yes, this would be my preference too

It is also consistent with what the rest of the system does

alamb avatar Dec 15 '25 13:12 alamb