duckdb [CSV Sniffer] Tweaking header detection

[CSV Sniffer] Tweaking header detection

Open pdet opened this issue 1 year ago • 1 comments

This change affects how header detection works for CSV files. In the previous algorithm, we preferred false negatives over false positives, leading us to miss headers in many different CSV files. For example

Single Row Files

creationDate, Id

Files with undetected types by auto-detect:

Value
"68,527.00"

Files with borked types:

Date
02/01/2019
08//01/2019

All Varchar

name
Pedro

The issue here is that most sane CSV files actually do have a header. I then changed the algorithm to always detect these cases correctly.

This change increases our accuracy in many of our tests. I believe that the only situation where our header detection will fail is when we have an all-varchar CSV file where the first row is not a header. For example:

Pedro;~29
Mark; >30

Because all columns of the CSV File are varchar, we will wrongfully detect Pedro;~29 as the header.

cc: @tdoehmen

Feb 16 '24 14:02 pdet

Edit:

I changed a lot of tests to remove now unnecessary header = true options. And added some, now needed header = false options.

Feb 16 '24 14:02 pdet

Thanks!

Feb 27 '24 12:02 Mytherin

duckdb duckdb copied to clipboard

[CSV Sniffer] Tweaking header detection

duckdb
duckdb copied to clipboard