cassava icon indicating copy to clipboard operation
cassava copied to clipboard

Add a quoting option for decoding

Open w9 opened this issue 9 years ago • 8 comments

Correct me if I'm wrong, but I think there's a quoting option for encoding, but none for decoding? I just encountered a case (TSV) where there are double quote character " in the middle of an entry, like this

col1    col2
----    -----------------
XXXX    xxxx, "xxx", xxxx
YYYY    yyyy, "yyy", yyyy

and Cassava halts after the first encounter of ".

w9 avatar Oct 02 '15 04:10 w9

There could perhaps be an option for quoting on decoding, but the above isn't valid CSV due to the extra whitespace.

tibbe avatar Oct 02 '15 07:10 tibbe

I agree that this is not strictly following the CSV standard, but unfortunately this is the case for a lot of pseudo-TSV files in the wild. In most cases the file is not ambiguous, as the entries are separated by \t. I work with a lot of medical data, and if there's a column for "description", then there will be all sorts of punctuation --- including quotes. As an example,

gene     description
CFTR     Also known as "ABCC7", it is ... 
TP53     Also known as "P53", it is ... 
...      ...

In any cases, I think the option of ignoring quotes, or at least the ability to set the quoting character, is essential to any CSV handling library.

w9 avatar Oct 02 '15 20:10 w9

Encountered this as well, getting parser error. Apparently this is still unresolved. Would you accept a PR to fix this or add an option?

But first, I'm not entirely sure this isn't a bug. In particular it seems weird to me that the parser for unescaped fields stops on double quote. Naively it should continue until the next delim or newline.

The offending line for me is: 14;14;04;5;01;1;Mazewo Dworskie"A";0120230;0120230;2020-01-01

Which produces:

parse error (Failed reading: conversion error: cannot unpack array of length 7 into a 7-tuple. Input record: [\"14\",\"14\",\"04\",\"5\",\"01\",\"1\",\"Mazewo Dworskie\"]) at \"A\";0120230;0120230;2020-01-01\r\n14;14;04;5;01;1;Mazewo Dworskie\"B\";0120247;0120247;2020-01-01\r\n14;14 (truncated)

Is there a valid reason for the parser to fail on this? The field doesn't start with a " so there is no reason to give special meaning to " later on. Is there?

Edit: I can open the file in LibreOffice without issues selecting field delimiter ; and string delimiter ".

mulderr avatar Jan 11 '20 20:01 mulderr

Turns out the issue is not whether one can do it but whether one should because it violates RFC4180:

If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

Since I do not have control over the input file I will probably need to look for a more general purpose library.

mulderr avatar Jan 11 '20 21:01 mulderr

@mulderr out of curiosity, has the .csv file been generated by a popular/mainstream service or library? or is this more niche?

hvr avatar Jan 12 '20 09:01 hvr

The file is part of an official register containing units of territorial division, cities and streets in Poland called TERYT. You can find it here. It's divided into 3 parts, the line above was copied from SIMC. I have no knowledge what software is used to generate it though.

I would say it's rather niche but a lot of larger organizations related to government, healthcare etc. would either want to or are downright required to use that data.

I suspect the file is specifically tweaked to be viewed with Excel. Note the semicolons and lack of quoting.

The same data is published in xml as well so I ended up using that.

mulderr avatar Jan 12 '20 12:01 mulderr

I came across this issue today myself. I believe it is an MS Excel weirdness that it does not conform to the RFC4180 rules correctly and doesn't always quote strings which contain ". Specifically this line https://github.com/haskell-hvr/cassava/blob/5996d4f6c6a88dbb77788d7583ed6f1c1ed90b38/src/Data/Csv/Parser.hs#L170 terminates and unquoted field if it comes across a double quote, however this is not necessary as far as I can see because we already know the field is an unescaped field, so we can take until the end of the field.

I do not see any problems with removing this check from the lambda but I have not run tests. I am considering forking the repo and providing a pull request in order to avoid the bug in my own code.

robwithhair avatar Mar 10 '20 14:03 robwithhair

Above pull request with passing tests. I believe although correct RFC4180 is to have these fields quoted, it is better to parse the field if we can and avoid misleading error message.

robwithhair avatar Mar 10 '20 15:03 robwithhair