miller icon indicating copy to clipboard operation
miller copied to clipboard

Handle non-RFC-compliant backslash-escaped quotes in CSV

Open jiripsota opened this issue 6 years ago • 1 comments

I found quite serious problem with backslash-escaped quotes. It doesn't work when is this quote followed by same character as field divider.

comma-backslash.csv id,name,price 1,"ESCAPING QUOTES WITH BACKSLAH \" WORKS",123.44 2,"COMBINATION WITH BACKSLASH-ESCAPED QUOTES\", AND COMMA CHAR AFTER QUOTES DOES NOT WORK",666

mlr --csv check comma-backslash.csv mlr: syntax error: unwrapped double quote at line 2.

Exactly the same result with different type of field divider, e.g. semicolon.

semicolon-backslash.csv id;name;price 1;"ESCAPING QUOTES WITH BACKSLAH \" WORKS";123.44 2;"COMBINATION WITH BACKSLASH-ESCAPED QUOTES\"; AND SEMICOLON CHAR AFTER QUOTES DOES NOT WORK";666

mlr --csv --ifs semicolon check semicolon-backslash.csv mlr: syntax error: unwrapped double quote at line 2

When using double-quotes, everything works properly.

double-quotes.csv id;name;price 1;"ESCAPING USING DOUBLE QUOTES "" WORKS";123.44 2;"COMBINATION WITH DOUBLE QUOTES""; AND SEMICOLON CHAR WORKS";666

mlr --csv --ifs semicolon check double-quotes.csv

jiripsota avatar Sep 13 '19 07:09 jiripsota

The issue is that in RFC-compliant CSV, the way to escape double quotes is to repeat them: "" rather than \". This is contrast to spec-compliant JSON which uses \" rather than "".

Examples:

$ echo '{"a":"b""c""d"}' | jq .
parse error: Expected separator between values at line 1, column 11
$ echo '{"a":"b\"c\"d"}' | jq .
{
  "a": "b\"c\"d"
}
$ echo '{"a":"b""c""d"}' | mlr --ijson --oxtab cat
mlr: Unable to parse JSON data: Line 1 column 0: Expected , before "
$ echo '{"a":"b\"c\"d"}' | mlr --ijson --oxtab cat
a b"c"d
$ mlr --icsv --oxtab cat <<EOF
a,b,c
1,"2,\"3\",4",5
EOF
mlr: syntax error: unwrapped double quote at line 1.
$ mlr --icsv --oxtab cat <<EOF
a,b,c
1,"2,""3"",4",5
EOF
a 1
b 2,"3",4
c 5

So this turns out to be a request to handle non-RFC-compliant CSV.

Which is not a bad idea, but it isn't a bug; it's a design change.

johnkerl avatar Sep 21 '19 21:09 johnkerl