duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

[CSV Sniffer] Tweaking header detection

Open pdet opened this issue 1 year ago • 1 comments

This change affects how header detection works for CSV files. In the previous algorithm, we preferred false negatives over false positives, leading us to miss headers in many different CSV files. For example

  1. Single Row Files
creationDate, Id
  1. Files with undetected types by auto-detect:
Value
"68,527.00"
  1. Files with borked types:
Date
02/01/2019
08//01/2019
  1. All Varchar
name
Pedro

The issue here is that most sane CSV files actually do have a header. I then changed the algorithm to always detect these cases correctly.

This change increases our accuracy in many of our tests. I believe that the only situation where our header detection will fail is when we have an all-varchar CSV file where the first row is not a header. For example:

Pedro;~29
Mark; >30

Because all columns of the CSV File are varchar, we will wrongfully detect Pedro;~29 as the header.

cc: @tdoehmen

pdet avatar Feb 16 '24 14:02 pdet

Edit:

I changed a lot of tests to remove now unnecessary header = true options. And added some, now needed header = false options.

pdet avatar Feb 16 '24 14:02 pdet

Thanks!

Mytherin avatar Feb 27 '24 12:02 Mytherin