duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

[CSV Sniffer] Selecting file to sniff from Glob and List

Open pdet opened this issue 5 months ago • 1 comments

This PR implements an adjustment to the CSV Sniffer regarding selecting the file used for sniffing.

Note that the column names and types depend on the file used for sniffing.

The new rule checks if the read_csv function is given a list of files with a size > 1. If so, and if the first file is not a glob pattern, that file will be used as the sniffed file for the function.

For example, in all the cases below, a.csv would be used as the sniffed file:

For example, all options below would use a.csv as the sniffed file:

  • read_csv(['a.csv'])
  • read_csv(['a.csv', 'b_1.csv', 'b_2. csv'])
  • read_csv(['a.csv', 'b_*.csv'])

However, if the first path is a glob, we search the first 10 files for either a file larger than 1 MB or the largest file and use it for sniffing.

For example, the following options would trigger the search algorithm:

  • read_csv(['b_*.csv'])
  • read_csv(['b_*.csv', 'a.csv'])

Fix: https://github.com/duckdblabs/duckdb-internal/issues/1824 Related to: https://github.com/duckdb/duckdb/discussions/11909

pdet avatar Sep 02 '24 15:09 pdet