duckdb
duckdb copied to clipboard
[CSV Sniffer] Tweaking header detection
This change affects how header detection works for CSV files. In the previous algorithm, we preferred false negatives over false positives, leading us to miss headers in many different CSV files. For example
- Single Row Files
creationDate, Id
- Files with undetected types by auto-detect:
Value
"68,527.00"
- Files with borked types:
Date
02/01/2019
08//01/2019
- All Varchar
name
Pedro
The issue here is that most sane CSV files actually do have a header. I then changed the algorithm to always detect these cases correctly.
This change increases our accuracy in many of our tests. I believe that the only situation where our header detection will fail is when we have an all-varchar CSV file where the first row is not a header. For example:
Pedro;~29
Mark; >30
Because all columns of the CSV File are varchar, we will wrongfully detect Pedro;~29
as the header.
cc: @tdoehmen
Edit:
I changed a lot of tests to remove now unnecessary header = true
options. And added some, now needed header = false
options.
Thanks!