TSV format: double quotes inside fields without double quotes enclosing
Linked issue: #60 Double quotes break tooltips and colouring
I get an error and your parser doesn't work correctly when I use something like this (for example, test.tsv file):
id title text
1 Doesn't work This sentence contains double quotes and is very "problematic" for parser
2 Record after This record is badly formatted, too (colors). Its due to error in the 2nd record
I use TSV format without enclosing field by double quotes, so double quotes are regular char for me. There is no reason to parse them as some "meaning chars". For storing special chars as \n, \t, \r and , I use this format[1][2][3]:
char desc
\t Tab char
\n New line
\r Carriage return
\\ Backslash char
Recently you referred to RFC 4180. There is this sentence:
- If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
"aaa","b""bb","ccc"
The most important for us is the part "If double-quotes are used to enclose fields". In my case (and in issue #60), it's not true, so the rest of the sentence isn't relevant.
In my opinion, double quotes are regular characters in this case of use, so your parser should respect it. What do you think?
The current parser implementation based on RFC 4180 is correct with respect to those double quotes - the provided CSV example isn't though.
You are referencing one sentence from RFC 4180 out of the context. The previous bullet points read as follows:
- Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
So the sentence "If double-quotes are used to enclose fields..." is pretty much contributing to the fact that double-quotes are not required for fields not containing "line breaks (CRLF), double quotes, and commas", but IF they do, then "a double-quote appearing inside a field must be escaped by preceding it with another double quote".
I suppose you are requesting the very same escape strategy as described in #180, which I still consider a valid enhancement that I haven't had the time to work on yet.
I suppose you are requesting the very same escape strategy as described in #180, which I still consider a valid enhancement that I haven't had the time to work on yet.
This is strategy for CSV, so it's a little different. I don't use <TAB> but \t instead. I use this (as I mentioned in the first post of this issue):
char desc
\t Tab char
\n New line
\r Carriage return
\\ Backslash char
Example from #180 will look like this (with tabs instead of commas):
Hello \t World 1
You are referencing one sentence from RFC 4180 out of the context. The previous bullet points read as follows:
Yes, you're right. But this is RFC for CSV files, not TSV. TSV format is based on CSV only and in practice a little different way is used (as I wrote).
haven't had the time to work on yet.
Ok, no problem.