qsv
qsv copied to clipboard
BUG Incorrect delimiter in qsv sniff
Describe the bug QSV sniff from 0.102, 0.112, and 0.125 is incorrectly determining the delimiter for a specific file.
To Reproduce incorrect-delimiter.csv
$ qsv sniff file.csv
Path: ...file.csv
Sniff Timestamp: 2024-04-03T15:41:05.706837+00:00
Last Modified: 2024-04-03T15:18:29+00:00
Delimiter: E
Header Row: true
Preamble Rows: 0
Quote Char: none
Flexible: false
Is UTF8: true
Detected Mime Type: text/plain
Detected Kind: Other
Retrieved Size (bytes): 4,375
File Size (bytes): 4,375
Sampled Records: 52
Estimated: false
Num Records: 52
Avg Record Len (bytes): 76
Num Fields: 2
Stats Types: false
Fields:
0: Text Quarter,Completions N
1: Text W,Commencements,Completions > Commencements ,%YoY Completions ,%YoY Commencements ,YoY Completions ,YoY Commencements ,%QoQ Completions ,%QoQ Commencements ,QoQ Completions ,QoQ Commencements
Expected behavior
Comma as the delimiter. Columns the same as qsv stats
or qsv headers
.
$ qsv headers incorrect-delimiter.csv
1 Quarter
2 Completions NEW
3 Commencements
4 Completions > Commencements
5 %YoY Completions
6 %YoY Commencements
7 YoY Completions
8 YoY Commencements
9 %QoQ Completions
10 %QoQ Commencements
11 QoQ Completions
12 QoQ Commencements
$ qsv stats incorrect-delimiter.csv
field,type,sum,min,max,range,min_length,max_length,mean,stddev,variance,nullcount,sparsity
Quarter,String,,Q1 12,Q4 23,,5,7,,,,0,0
Completions NEW,Integer,64693,106,3799,3693,3,4,1244.0962,956.6243,915130.0484,0,0
Commencements,String,,"1,855",898,,2,5,,,,0,0
Completions > Commencements ,String,,FALSE,TRUE,,4,5,,,,0,0
%YoY Completions ,String,,-0.60%,92.80%,,0,7,,,,4,0.0769
%YoY Commencements ,String,,-11.20%,88.00%,,0,7,,,,4,0.0769
YoY Completions ,Float,12016.1609,-885.9696684,1487.868126,2373.8378,0,12,250.3367,462.0062,213449.7579,4,0.0769
YoY Commencements ,Integer,13255,-2823,3859,6682,0,5,276.1458,999.0011,998003.2912,4,0.0769
%QoQ Completions ,String,,-0.50%,9.26%,,0,7,,,,1,0.0192
%QoQ Commencements ,String,,-0.10%,91.50%,,0,8,,,,1,0.0192
QoQ Completions ,Float,3661.8907,-982.0637068,819.2936393,1801.3573,0,12,71.8018,334.6705,112004.3264,1,0.0192
QoQ Commencements ,Integer,3936,-2064,4618,6682,0,5,77.1765,923.7335,853283.5963,1,0.0192
Desktop (please complete the following information):
- OS: Ubuntu and MacOS
- qsv Version 102, 112 125
sniff
uses the qsv-sniffer crate, which in turn, is a fork of the csv-sniffer crate.
I ended up creating qsv-sniffer since it seems csv-sniffer was no longer being actively maintained as shown by the numerous unanswered issues.
But TBH, the Viterbi algorithm it uses to sniff and infer CSV schemas is still something I don't fully grok, so it trips up on certain CSVs.
It works well enough for "typical" CSVs and I've tweaked it enough to remove the panics, but the whole Viterbi inferencing part needs to be refactored.
FYI, my intent with qsv sniff
is to help power a next-gen harvester, that's why I've added all the other extra stuff (mime-type sniffing, range requests sniffing, etc.).
Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?
I'll take a look on it -- We've apparently hit this before (according to a co-worker) and we've got a build of dp+ that basically ignores non-standard delimiters. So immediate issue is patched around.
So, I'm not good with rust (you've seen basically all of what I've ever done), but this confuses me: https://github.com/jqnatividad/qsv-sniffer/blob/master/src/sniffer.rs#L506 . We're calling this with all of the possible quote characters in character, delim is None. This for the "most common" case (csv with "), this regex (somewhat simplified) is hopefully looking for "[ ]*?,[ ]*"
. If there are no quotes in the csv, then I'm unclear what it's ever going to match, but apparently it was coming up with E?
The delimiter count here is only going to ever pick up a valid delimiter if there's a quote delim quote
pattern, which doesn't look like it's going to happen with this test file, and isn't necessarily going to be consistent unless it's a 'quote all' type of file. I suspect if the initial guess of the delimiter was good, then we'd probably get the Viterbi confirming the choice.
What if we look at a sample of likely delimiters. ,\t;|:
, and ran counts per line. even in the ignorance of quotes, there should be one that has a roughly consistent number of occurrences per line, with the min value likely to be approximately the number of columns. It would at least rule out ones that don't appear, or have large variations in the number of occurrences per line. That would probably work better for then figuring out a quote, because quotes should be either [quote]{2}
, or [quote][delim]
or [delim][quote]
, or it's not a valid quoting for the file.
So having a look at what python's csv.sniff is doing, the _guess_quote_and_delimiter
https://github.com/python/cpython/blob/main/Lib/csv.py#L273 is very similar, but covers all 4 possible quote patterns for a quoted field, not trying to find the possible cases for a delimiter between quotes.
It also includes a plain _guess_delimiter
https://github.com/python/cpython/blob/main/Lib/csv.py#L349 that's essentially what I was suggesting in the last para as a fallback to when guessing the quote and delimiter at the same time doesn't work.
Thanks for digging into this @EricSoroos !
Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!
Do you know if it uses the same Viterbi algorithm? If not, I'm inclined to just rewrite qsv-sniffer to just do a straight port of python's csv sniffer...
It doesn't look like it -- it just looks like a frequency based check. Viterbi looks like a general constrain satisfaction algorithm, so it's just one way to determine if a set of parameters is the most likely description of the data.
Awesome! I'll create a new branch on the qsv-sniffer crate then and start porting over the code.
Would be nice if we can co-maintain it as there's really nothing like python's csv sniffing in the Rust ecosystem beyond csv-sniffer and qsv-sniffer. Polars has a very fast multi-threaded, mem-mapped CSV reader, and it has some CSV schema sniffing smarts, but its not a general library that can be used separately without bringing in a lot of polars crates.
Here's the tracking issue for the EPIC (in more ways than one 😉 ) port of python's csv-sniff
https://github.com/jqnatividad/qsv-sniffer/issues/14
Perhaps, the best way to "fix" this is to NOT use the qsv-sniffer crate and just use the workhorse csv crate instead.
The new approach will just attempt to open the candidate file as a CSV file, and get the metadata that way.