datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Can I Filter Files During Multi-File Query?

Open bi-kai opened this issue 3 years ago • 2 comments

I have a folder containing .parquet files. The schemas of the files are the same, and the file names are distinguished by timestamp.

The SQL statement, select * from table where timestamp 1645172395000 and 1645272395000, takes a long time for query. I want to filter out the files that are not within the time range during query.

In other words, I want to use the multi-file query function, but filter out files whose timestamp is not within the query time range. Is it possible?

bi-kai avatar Aug 03 '22 13:08 bi-kai

I think the solution here would be to implement a specialized version of https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list where a function can be passed to determine which files get selected (based on file names).

@tustvold WDYT?

andygrove avatar Aug 05 '22 13:08 andygrove

So DataFusion has fairly mature support for predicate pruning such as you describe, in particular PruningPredicate.

Assuming the predicate is pushed down to the TableScan, something you can confirm by running EXPLAIN on the LogicalPlan, the following should happen automatically.

  • If using ListingTable as the catalog, any non-matching partitions will be filtered out
  • If using ListingTable and collect_stat is enabled, the files will be pruned at plan time based on their metadata
  • ParquetExec will prune the row groups based on the filter at execution time

So at least theoretically, this should already be being performed. Perhaps you might be able to clarify:

  • Do you have an actual catalog, or are you using ListingTable. In particular what TableProvider are you using?
  • Is your data partitioned at all? Is the TableProvider aware of this?
  • Do you have a catalog that can provide file-level metadata?
  • Where is you data stored, is it local disk, S3, something else?

think the solution here would be to implement a specialized version of https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list where a function can be passed to determine which files get selected (based on file names).

To me this sounds like a quirk of a very specific kind of data catalog, specifically one that relies on directory listing. We could potentially add some sort of support for file-name extraction to ListingTable, but it is unclear why this would benefit from being pushed down to ObjectStore.

tustvold avatar Aug 05 '22 14:08 tustvold

I think this ticket is not tracking anything -- it is too bad it is not a discussion so it may not be easy to find. However, I don't think it is actionable at this time so closing it

alamb avatar Oct 11 '23 20:10 alamb