cassava
cassava copied to clipboard
Add a quoting option for decoding
Correct me if I'm wrong, but I think there's a quoting option for encoding, but none for decoding? I just encountered a case (TSV) where there are double quote character "
in the middle of an entry, like this
col1 col2
---- -----------------
XXXX xxxx, "xxx", xxxx
YYYY yyyy, "yyy", yyyy
and Cassava halts after the first encounter of "
.
There could perhaps be an option for quoting on decoding, but the above isn't valid CSV due to the extra whitespace.
I agree that this is not strictly following the CSV standard, but unfortunately this is the case for a lot of pseudo-TSV files in the wild. In most cases the file is not ambiguous, as the entries are separated by \t
. I work with a lot of medical data, and if there's a column for "description", then there will be all sorts of punctuation --- including quotes. As an example,
gene description
CFTR Also known as "ABCC7", it is ...
TP53 Also known as "P53", it is ...
... ...
In any cases, I think the option of ignoring quotes, or at least the ability to set the quoting character, is essential to any CSV handling library.
Encountered this as well, getting parser error. Apparently this is still unresolved. Would you accept a PR to fix this or add an option?
But first, I'm not entirely sure this isn't a bug. In particular it seems weird to me that the parser for unescaped fields stops on double quote. Naively it should continue until the next delim or newline.
The offending line for me is:
14;14;04;5;01;1;Mazewo Dworskie"A";0120230;0120230;2020-01-01
Which produces:
parse error (Failed reading: conversion error: cannot unpack array of length 7 into a 7-tuple. Input record: [\"14\",\"14\",\"04\",\"5\",\"01\",\"1\",\"Mazewo Dworskie\"]) at \"A\";0120230;0120230;2020-01-01\r\n14;14;04;5;01;1;Mazewo Dworskie\"B\";0120247;0120247;2020-01-01\r\n14;14 (truncated)
Is there a valid reason for the parser to fail on this? The field doesn't start with a "
so there is no reason to give special meaning to "
later on. Is there?
Edit:
I can open the file in LibreOffice without issues selecting field delimiter ;
and string delimiter "
.
Turns out the issue is not whether one can do it but whether one should because it violates RFC4180:
If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
Since I do not have control over the input file I will probably need to look for a more general purpose library.
@mulderr out of curiosity, has the .csv file been generated by a popular/mainstream service or library? or is this more niche?
The file is part of an official register containing units of territorial division, cities and streets in Poland called TERYT. You can find it here. It's divided into 3 parts, the line above was copied from SIMC. I have no knowledge what software is used to generate it though.
I would say it's rather niche but a lot of larger organizations related to government, healthcare etc. would either want to or are downright required to use that data.
I suspect the file is specifically tweaked to be viewed with Excel. Note the semicolons and lack of quoting.
The same data is published in xml as well so I ended up using that.
I came across this issue today myself. I believe it is an MS Excel weirdness that it does not conform to the RFC4180 rules correctly and doesn't always quote strings which contain ". Specifically this line https://github.com/haskell-hvr/cassava/blob/5996d4f6c6a88dbb77788d7583ed6f1c1ed90b38/src/Data/Csv/Parser.hs#L170 terminates and unquoted field if it comes across a double quote, however this is not necessary as far as I can see because we already know the field is an unescaped field, so we can take until the end of the field.
I do not see any problems with removing this check from the lambda but I have not run tests. I am considering forking the repo and providing a pull request in order to avoid the bug in my own code.
Above pull request with passing tests. I believe although correct RFC4180 is to have these fields quoted, it is better to parse the field if we can and avoid misleading error message.