datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add support for glob string in datafusion-cli query

Open a-agmon opened this issue 7 months ago • 2 comments

Partly closes #16303

Introduces glob() table function that allows running queries on multiple files, like:

 SELECT id FROM glob('s3://tests/data/file-a*.csv');
 SELECT id FROM glob('s3://tests/*/*.csv');

note that the latter statement include 2 glob layers (2 wildcards) that work on only if you enable

SET datafusion.execution.listing_table_ignore_subdirectory = false;

Integration tests were added to test

a-agmon avatar Jun 08 '25 17:06 a-agmon

@alamb - thank you very much for the generous comments. I appreciate it. Re naming - I completely agree. Was just wondering whether its better to introduce one function that infer the file type (like read() or glob()) rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this. Re the other comments - will review and handle. Thanks.

a-agmon avatar Jun 11 '25 05:06 a-agmon

rather than a function for each file type (read_parquet, read_csv, etc). You are correct that the latter is more common so will for this.

I think the reason that DuckDB et al use a function for each file type is that it simplifies option handling (there are many options that are better suited for parquet that are not csv)

That being said, adding a function like read_file(..) or read_data(...) that handled all file types might be a reasonable thing to do in datafusion-cli as then you could probably reuse most of the ListingTable code

alamb avatar Jun 13 '25 14:06 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 14 '25 02:08 github-actions[bot]