vscode_rainbow_csv icon indicating copy to clipboard operation
vscode_rainbow_csv copied to clipboard

Supporting linebreaks inside double quoted csv fields

Open jovannic opened this issue 7 years ago • 24 comments

Key,Text
X,"Hello, mechatroner,
This, incorrectly interpreted as a new row with two columns"

image

According to RFC 4180 the quoted entry spanning a couple lines should be a single cell (i.e. everything after "Hello" in this example should be in blue)

jovannic avatar Feb 26 '18 20:02 jovannic

RFC 4180 is not a "standard". And according to RFC 4180 the only legal line separator in csv files is CRLF, so if we were to follow the letter of that memo we would have to stop highlighting half of the files because they have wrong LF separators. So we can't follow the letter of the memo, but maybe we should follow the spirit of it instead? Actually Rainbow CSV already almost does it because it correctly handles escaped separators and double quotes inside quoted fields when there is no line break inside that field. The only discrepancy is when there is a record separator inside a field, as you've noticed. Now I have some arguments why allowing linebreaks inside a field in csv is a terrible idea:

  1. Most people do not know that some fields in csv can contain linebreaks. Actually some significant portion of programmers would even think that parsing csv as fields = line.split(',') is a good idea
  2. csv with linebreak is not context free anymore. If we have one line with unbalanced double quote it would break the whole file, whereas in many applications where csv's are used it is acceptable to just throw away some corrupted lines.
  3. suppose we have a very large file, and we opened it first at the beginning and then in the middle, and we see something like this:
Name,Age,Comment
Jane,29,Nothing special
...
John,30,Nothing special
Mary,25,Drives a red car
John,31,Nothing special
Adam,25,Likes cats
Nancy,19,Nothing special
Tom,40,Nothing special
...

Now if we allow line breaks in fields we can't tell whether or not "Adam" is in the database, because somewhere at the beginning of the file we could have something like this:

Name,Age,Comment
Jane,29,Nothing special
...
Jack,30,Likes to play football
Sarrah,40,"Has a long list of imaginary friends:
...
John,30,Nothing special
Mary,25,Drives a red car
John,31,Nothing special
Adam,25,Likes cats
Nancy,19,Nothing special
Tom,40,Nothing special
...
Jack,30,Suspects something"
Mike,25,Doesn't like cats
...
Mike,30,Nothing special
  1. grep, 'wc -l', head, tail would fail on csv files with linebreaks.
  2. If we would implement a correct VS Code grammar to handle linebreaks in fields, how it would handle a case when a double quote is inserted by mistake into a large csv file without any double quotes? Probably there are some mechanisms that would prevent a spectacular fail, but anyway it could be funny.

So currently my opinion is that CSV files with linebreaks is not practical format, and because of that Rainbow CSV should treat lines with unbalanced double quotes as an error and should not try to find a matching double quote somewhere below.

mechatroner avatar Feb 27 '18 04:02 mechatroner

Sorry for the late response.

Yes, working with CSVs containing commas and newlines inside of a cell value is a pain...

It does mean that CSV is not context free and the file line count doesn't equal the row count. It's also the kind of place where Rainbow CSV would be the most helpful. If my CSVs had no newlines or commas in records I wouldn't have needed Rainbow CSV. It's these more complex files where I really wish it supported quoted values. Consider a setting?

It's a valid stance to say CSV has no definitive standard and you chose to not handle quoted values at all, but I believe that's less useful in practice. I would assume RFC 4180-ish parsing would be the more common expectation, but maybe I'm wrong?

Yes RFC 4180's choice of line endings is bizarre and a sad concession for Excel compatibility. I believe any sane parser would handle any style of line endings. Ours does. Most open-source parsers that I have looked at do.

I've tried a number of popular CSV editors including Excel, Google Sheets, Libre Office, CSVPad, and various less popular editors. All of them allow values containing commas and newlines inside of a cell and quote cells using these special characters.

As is Rainbow CSV's value is a bit limited for anyone working with CSVs containing arbitrary user data emitted by any popular editor. My use case is translation tables which often have text containing commas and newlines. We can't forbid commas, so we follow RFC 4180 with exception of the line endings. I'm editing parser test suite data by hand and generally avoiding opening a more heavyweight editor for small table tweaks. Editing a non-trivial RFC 4180-ish CSV by hand with no visual assist can be challenging. Rainbow CSV would be all I'd need.

Comments on your numbered points:

  1. Many "programmers" can't even handle fizzbuzz. I'm not sure if most programmers making a simplistic assumption is an argument for not supporting it? As tool/library writers our job is to think about those things for the people who won't. For any use case with user data you can't just say "no commas allowed." and in that case line.split(',') falls apart.

For all the rest of your points... Those are all real problems. Ignoring them is a an option but I'm not sure if it's the most appropriate one? They might just be intrinsic to the format...

jovannic avatar Mar 04 '18 19:03 jovannic

OK, You are right, it would be reasonable to support linebreaks in fields in a special mode. So I spent some time trying to implement a grammar that would do it. I've started with the grammar that I use in Atom, it is similar to VSCode grammar (since both editors use TextMate-based grammars), but has some differences. So the main regex in Atom consists of 10 identical parts:

(?:((?:"(?:[^"]*"")*[^"]*"(?:,|$))|(?:[^",]*(?:,|$)))|(.+))?

This smaller regex, in turn, consists of 3 parts separated by OR | character:

  1. (?:"(?:[^"]*"")*[^"]*"(?:,|$)) - here we try to match double quoted csv field e.g. the second field in this line: 1234,"hello,""world""",1234
  2. (?:[^",]*(?:,|$)) - tries to match unquoted csv field
  3. (.+) - If neither of 1 and 2 has matched - that means that csv line is incorrectly formatted, so we produce "error" match until the end of the line.

So it is easy to modify part 1 and 3, so that quoted field would match across linebreaks and "error" would match till the end of the file. part 1 would look like this (?:"(?:(?:[^"]|\n)*"")*(?:[^"]|\n)*"(?:,|$)) and part 3 like this: ((?:.|\n)+). But this will not work, because Oniguruma/TextMate regexps will not match across linebreaks quote:

Note that the regular expressions are matched against only a single line of the document at a time. That means it is not possible to use a pattern that matches multiple lines. The reason for this is technical: being able to restart the parser at an arbitrary line and having to re-parse only the minimal number of lines affected by an edit. In most situations it is possible to use the begin/end model to overcome this limitation.

So it is currently not possible to use the modified regexp to match across multiple lines (unless VSCode would introduce a special "xlmatch" grammar keyword that would allow usage of \n in it). Someone could try to use the mentioned "begin/end" TextMate model, but I would be very suprised if it would work.

I will probably try to modify rainbow grammar in Vim to support linebreaks in fields just to see whether it would work or not.

And again there is a fundamental difference between csv which allow linebreaks and csv that doesn't. Many algorithms (like syntax highlighting, column number query, random sampling) will work in O(N) on the former and in O(1) on the later.

mechatroner avatar Mar 08 '18 03:03 mechatroner

I made a proof of concept in Vim

image

When highlighting gets glitchy this command fixes it:

:syntax sync fromstart

mechatroner avatar Mar 10 '18 15:03 mechatroner

Jut want to comment I would love to see this implemented in the vs code verison

krazyito65 avatar Apr 25 '18 19:04 krazyito65

How is going on? I hope this will help. https://github.com/jjuback/gc-excelviewer/commit/30a43987dd715736c784befbe497ae3e961e3817

ffxivvillein avatar Jul 19 '19 13:07 ffxivvillein

@ffxivvillein VSCode language grammars do not match across newlines, so even if we include newline in regexp it won't work.

mechatroner avatar Jul 19 '19 22:07 mechatroner

Based on the duplicate issues being opened, there really is a practical need to have this feature. From my personal perspective, I'd prefer to have the special mode on by default (in our environment, we write and read CSV files programmatically with csv libraries, and in all of them multi-line values are treated as valid cells), but I'd be happy to manually switch to "the multi-line mode" explicitly, if necessary.

Rots avatar Feb 03 '20 08:02 Rots

@Rots, I understand that there is a practical need for this. Actually I've recently implemented an option for RBQL console to correctly handle multiline values. This can be useful for cases when it is not required to keep the original file i.e. it is acceptable to replace newlines with e.g. 4 spaces. In this case you can write a query like this (with Python backend): select a1.replace('\n', ' '), a2.replace('\n', ' ') And get a new file with newlines replaced by the special sequence. Also "rfc_csv" syntax is now fully supported in Vim version of "Rainbow CSV". Existing textmate syntax engine for VSCode does not support newlines in regular expression. Currently VScode team is considering to adopt a new tokenization engine: https://github.com/microsoft/vscode/issues/77140 I will leave a comment on that ticket to list the set of features that we might need to implement multi-line values highlighting.

mechatroner avatar Feb 04 '20 01:02 mechatroner

I implemented multiline fields support in Sublime Text version of "Rainbow CSV": https://packagecontrol.io/packages/rainbow_csv Still no solution in sight for VSCode.

mechatroner avatar Sep 16 '20 02:09 mechatroner

What about multi-line strings (marked by double-quotes) being correctly interpreted in other formats (e.g. programming languages) in VSCode? Is it done by something else rather than the syntax engine you mentioned?

JanisE avatar Sep 02 '21 12:09 JanisE

@JanisE Good question! Yes, it may seem that many other grammars in VSCode don't have any problems with multiline strings. But this is because they are using begin/end matching which Rainbow CSV can't use. The 2 types of matching are described here: https://macromates.com/manual/en/language_grammars#language_rules Rainbow CSV has to rely only on the first matching type which is provided in a single match expression. We can't use begin/end matching because CSV-s are non-recursive.

mechatroner avatar Sep 03 '21 04:09 mechatroner

I came looking for exactly this. I was getting confused/frustrated that the CSVLint and highlighting wasn't working for our "valid" quoted multi-line values, especially given that I saw the option for it in the RBQL Console which worked perfectly. I realize this is due to the technical limitations outlined above but it's still really really unfortunate, our use case warrants being able to have multi-line values. I guess a halfway ask is whether the CSVLint processor can accept quoted multi-line strings as valid even if the highlighting can't show it? I'd be perfectly fine with a setting to enable/disable as desired, same as in the RBQL Console.

jjspace avatar Nov 11 '21 18:11 jjspace

@jjspace Thanks, I think it is possible to adjust CSVLint so that it can accept quoted multi-line strings, and I agree that this could be a useful improvement in some cases.

mechatroner avatar Nov 13 '21 23:11 mechatroner

I am exploring a potential path forward to support multiline highlighting using a combination of these two VSCode API features:

  1. onDidChangeTextEditorVisibleRanges event which would allow dynamic highlighting of the currently visible file fragment.
  2. setDecorations method (or alternatively semantic fragment coloring - haven't tested this feature yet).

This approach looks promising, but there are some issues associated with it (including the dynamic nature of the highlighting and out-of-theme colors) so I am not sure if the end result would be decent enough to publish.

mechatroner avatar May 30 '22 17:05 mechatroner

I stumbled on this issue the other day, and decided to try my hand at it.

I am not certain it can be extended to all valid CSV separators, but I found a way to achieve multiline quote support using nested "begin" and "end" matchers. I thought I'd share it in case it turns out to be useful.

Basically, each pattern matches the entire rest of the row, but defines a nested pattern for the next color. Once you get to the last color, you use "$self" to loop back to the start of the grammar. I also use a lookahead matcher to define "end", so that a single newline character can end all active rainbow patterns.

{
  "name": "csv syntax",
  "scopeName": "text.csv",
  "fileTypes": ["csv"],
  "patterns": [
    {
      "name": "variable.other.rainbow1",
      "begin": "^|\\,",
      "end": "(?=\\n)",
      "patterns": [
        { "include": "#quotedvalue" },
        {
          "name": "keyword.rainbow2",
          "begin": "\\,",
          "end": "(?=\\n)",
          "patterns": [
            { "include": "#quotedvalue" },
            {
              "name": "entity.name.function.rainbow3",
              "begin": "\\,",
              "end": "(?=\\n)",
              "patterns": [
                { "include": "#quotedvalue" },
                { "include": "$self" }
              ]
            }
          ]
        }
      ]
    }
  ],
  "repository": {
    "quotedvalue": {
      "begin": "\"",
      "end": "\""
    }
  },
  "uuid": "ca03e352-04ef-4340-9a6b-9b99aae1c418"
}

ajhyndman avatar Jul 21 '22 04:07 ajhyndman

Thank you very much, @ajhyndman! Your syntax is very impressive, took me some time to understand how it works. I was absolutely sure that there is no way to implement something like this using TextMate grammar, turns out I was wrong about it. I will definitely consider using your syntax as a backup plan or as an improvement for the solution that I am currently developing in the rfc branch - I recently discovered two other approaches using VSCode tokenization API, one of them being vscode.languages.registerDocumentRangeSemanticTokensProvider - it allows to highlight just the lines which are currently visible in the editor window + some margin lines around them. This allows to fully support RFC 4180 and also use dynamic highlighting for all possible separator strings (even multi-char separators) and also highlight comment lines. But the semantic tokenization method has some drawbacks too, soo it is great that we now have this alternative solution! I hope to publish something within the next month.

mechatroner avatar Jul 22 '22 04:07 mechatroner

I just published version 3.0.0 which supports multiline fields with some limitations and raw edges, the new RFC4180-compatible dialect is implemented a separate "syntax" which is called "dynamic CSV" - it means that highlighting is done through VSCode syntax tokens mechanism instead of a pre-build grammar. The correct multiline "dynamic CSV" dialect can trigger in 2 cases:

  1. Automatically, when Rainbow CSV detects that the newly opened CSV file contains multiline fields.
  2. Manually, you need to select separator (currently only , and ; are supported) and run Rainbow CSV: Set separator - Quoted Multiline Fields command.

If everything goes well I will soon publish version 4.0.0 with the syntax proposed by @ajhyndman which will combine two CSV dialects into one.
BTW version 3.0.0 has also some other features such as comment lines highlighting and support for infinite number of single-character and multicharacter separators - this is the reason why dynamic syntax highlighting is crucial and will remain as one of the highlighting mechanisms of the extension anyway. But pre-build grammars are still superior in many ways so I am planning to continue using them for popular dialects such as csv, tsv, pipe-separated files, and some others.

mechatroner avatar Sep 10 '22 02:09 mechatroner

I've not been able to get this to work with the demo csv from the top of this thread

Key,Text
X,"Hello, mechatroner,
This, incorrectly interpreted as a new row with two columns"
image

Running the command Rainbow CSV: Set separator - Quoted Multiline Fields results in the following error

Only comma and semicolon separators are currently supported to use with multiline fields.

Running the command Rainbow CSV: Set separator - Basic appears to have no effect when a comma , is selected; although it does have an amusing side effect when no character is selected 😝

image

petervandivier avatar Sep 12 '22 13:09 petervandivier

@petervandivier I just tested and it worked for me. Although I discovered that when the filetype is already "Dynamic CSV" the highlighting doesn't change immediately and I have to click on another tab and then back for the new highlighting to take effect, this has to be fixed ( I would also have to fix the amusing side effect, sorry. The separator should not be an empty string, of course). And I noticed on your screenshot that the filetype is CSV, but it should be "Dynamic CSV", are you sure that the comma character is selected when you are running the command? This is how it looks for me: rainbow_screen

mechatroner avatar Sep 13 '22 23:09 mechatroner

I thought maybe I was having a bad interaction with another extension but I've tried this in a VM with a fresh VS Code install and Rainbow CSV as the only extension and the behavior still seems off ☹️ (also tried tab switching though that's not reflected in the screencaps)

image image

Notably(?), Dynamic CSV doesn't lint at all for me in either my normal install or the VM 😕

Sorry I can't figure it out. Hope the feedback is helpful.

petervandivier avatar Sep 14 '22 18:09 petervandivier

@petervandivier Could you please try it at http://vscode.dev please and perhaps check if Set separator -Basic works with some exotic separators like 0 (this would ensure that a non-builtin highlighting is used). Also, I just published version 3.1.0 (with some bug fixes mostly) which allows selecting a separator in the pop-up menu after Dynamic CSV filetype selection in the bottom right corner. Can I also ask you to rename the file (with non-csv extension) and perform the experiment again and maybe also just copy the file content into a new empty text file (File -> New Text File) and maybe also open a large text file with multiline fields e.g. this one: https://github.com/okfn/publicbodies/blob/main/data/br.csv (the test is important because it allows the autodetection to kick in, which would be the most frequent use case)?

mechatroner avatar Sep 17 '22 22:09 mechatroner

@petervandivier Actually I was just able to reproduce this, one of the problems is that the file is too short (if you make it 10 lines at least is will work differently) for it to be recognized as multiline csv (I might fix this later) so it gets assigned the standard csv dialect instead of rfc csv (dynamic csv) because it has .csv extension. Now when you switch it to "Dynamic CSV" through the filetype selection menu (instead of using Rainbow CSV: Set separator - Quoted Multiline Fields, the old dialect (tuple separator, policy) gets carried over (the 3.0.0 had a bug/feature to prevent it from working with the standard quoted csv dialect). So there are some other issues here that I can fix e.g. not to carry over dialects which were assigned automatically, only carry over to dynamic csv manually selected dialects, so I will fix this. Plus finally enabling pre-build universal rfc-compatible csv grammar will fix this too. The bottom line is that we will get it right. And thanks for reporting this, I will do some adjustments in the next release.

mechatroner avatar Sep 18 '22 05:09 mechatroner

Confirm the large file works locally for me (although it was briefly auto-detected as Dynamic CSV before my markdown extension highjacked the language mode to markdown and I had to manually set it back). The large file also works on vscode.dev

The small file also works locally now on 3.1.0 - worth noting though that the comma character must be selected when invoking Rainbow CSV: Set separator - Quoted Multiline Fields

image

Thanks! 🥳

petervandivier avatar Sep 20 '22 15:09 petervandivier

Update: I recently published version 3.3 added support for column alignment for CSV files with multiline fields - this provides feature parity between single-line and multiline csv files.

I was also experimenting with the grammar created by @ajhyndman which I want to use as the default syntax for csv files. So far I discovered one issue - if a csv file (even a regular one, without multiline fields) contains a comment line with unbalanced double quotes (this probably doesn't happen very often, but still) it would mess up highlighting for the whole file, e.g.

# This is a table with various brands, such as "Foo", "Bar, "Foobar"
Brand,Year
Foo,1997
Bar,2003
XYZ,1998
ETC,2003

Setting # as a comment prefix doesn't help here because semantic highlighting and grammar highlighting work on different levels.

There are some workarounds for this, e.g. using dynamic csv when the user chooses a comment prefix (instead of using dynamic csv for rfc files as it is currently implemented).

mechatroner avatar Nov 16 '22 04:11 mechatroner