csvlint icon indicating copy to clipboard operation
csvlint copied to clipboard

Problem with quotes in tsv files

Open boydkelly opened this issue 1 year ago • 4 comments

When linting tsv files, I get:

$ csvlint -delimiter='\t' build/neo_ex.tsv 
Warning: not using defaults, may not validate CSV to RFC 4180
Record #1035 has error: bare " in non-quoted-field

unable to parse any further

The record 1035 is as follows. But since this is tsv (for this very reason) should any quoting not be totally ignored as an error?

9010c36f-6958-48d9-ba2d-c50f65c8825d	dondon ko "ken ken kileri kɛ".	dyu	exm	dyuEx

boydkelly avatar Sep 01 '24 08:09 boydkelly

Parsing and detecting errors in this utility is handled by https://pkg.go.dev/encoding/csv#Reader

Which seems to complain if the quotes are not the first or last character in the field.

  1. In your sample text is the double quoted field delimited by tabs as in dondon ko\t"ken ken kileri kɛ".\tdyu ?

  2. Or is there whitespace before the leading quote as in dondon ko\t "ken ken kileri kɛ".\tdyu ?

Only the second case throws the error for me.

kmatt avatar Sep 03 '24 15:09 kmatt

It certainly could be the second case. Since this is foreign language prose and not 'clean' text the expectation is that when it is defined as tab delimited then it should not matter if and where any quote may occur. So in your second example the text should 'properly' lint as with \t replaced by line feed:

dondon "ken ken kileri kɛ". dyu

So it looks like the bug is with csv#Reader?

I'm really just checking that the number of columns is accurate. And for now Awk will do the job, But it would be great to see tsv handled correctly here.

boydkelly avatar Sep 03 '24 17:09 boydkelly

So it looks like the bug is with csv#Reader?

I'm not certain if its a bug or not, because the Reader docs are not explicit on tab delimited data.

-lazyquotes may be an option in this case.

kmatt avatar Sep 03 '24 18:09 kmatt

I'll just use awk. The whole point of tab delimiters is to avoid the numerous problems of quote delimiters. In a tab delimited file quotes should not be considered as anything but another string character. I guess csv#Reader is true to its name, comma separated. It does not understand tabs correctly.

boydkelly avatar Sep 03 '24 18:09 boydkelly