vscode_rainbow_csv icon indicating copy to clipboard operation
vscode_rainbow_csv copied to clipboard

Add a preference entry to let user change the delimiter ( ; or tab or | etc...)

Open pieplu opened this issue 7 years ago • 40 comments

All on the title :)

pieplu avatar Jan 11 '18 20:01 pieplu

In short - there is currently no API in VSCode to do this, a request to add it was created 2 years ago and it is still open: https://github.com/Microsoft/vscode/issues/1800 I will mention this problem in the VSCode issue.

Rainbow highlighting is implemented as a "language" and requires a syntax file for each delimiter. It is not hard to generate many syntax files (2 for each possible delimiter because we want both quoted and non-quoted variant), but they will pollute language selection menu, and selection of the appropriate delimiter would be pretty inconvenient. The optimal way, I think, is to allow user select a delimiter in file with mouse cursor and select an option to use it as a delimiter (quoted or unquoted) from VSCode context menu.

mechatroner avatar Jan 12 '18 04:01 mechatroner

In fact it's pretty much an Excel issue because someone at MS decided to localize csv files so that Comma Separated actually means Semicolon Separated in German.

Anyway, we Germans have to live with that decision and this issue describes a real every day work issue.

Lercher avatar Feb 09 '18 15:02 Lercher

@Lercher Interesting, I didn't though much about this problem before. BTW Vim version of rainbow csv doesn't rely on file extension, instead there's a content-based detection algorithm which checks two separators: comma and TAB by default, but since you are saying that semicolon is so popular in Europe I will add it to that list. And again once https://github.com/Microsoft/vscode/issues/1800 is resolved content based auto-detection approach could be used in this extension too. For now I will just add semicolon syntax grammar with .scsv extension, which no one uses. At least this would allow manual semicolon selection.

mechatroner avatar Feb 10 '18 03:02 mechatroner

Just published a new version with semicolon separator, which has to be manually selected from the list of languages. Waiting for the linked VSCode ticket to add all possible ascii separators and content-based autodetection.

mechatroner avatar Feb 10 '18 21:02 mechatroner

Cool. Works on my machine. Thanks!

Lercher avatar Feb 11 '18 00:02 Lercher

Fiddled around with adding a new language but missing something. How about pipe separated? I would have thought copying the scsv language and updating the extension.js file would have done it but alas I've been defeated.

boeningc avatar Apr 02 '18 22:04 boeningc

@boeningc Did you modify the new pipe.tmLanguage.json file? You need to replace ; with | and prepend it with two \\ backslashes, one for regexp, another one for exterior json. The result will look like this:

    "patterns": [
        { "match": "((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?",

Also if you don't expect your pipe-separated files to contain double-quoted pipes, it would be better to modify tsv.tmLanguage.json instead.

mechatroner avatar Apr 03 '18 03:04 mechatroner

I did create a new file and change the regex to use 2 \\. I took the TSV pattern and change \\t to \\|

What I'm not seeing is the option in the languages selection. Sorry I wasn't clear about that earlier.

boeningc avatar Apr 03 '18 05:04 boeningc

{ "name": "pipe syntax", "scopeName": "text.pipe", "fileTypes": ["pipe"], "patterns": [ { "match": "([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)",

var dialect_map = {'csv': [',', 'quoted'], 'tsv': ['\t', 'simple'], 'csv (semicolon)': [';', 'quoted'], 'csv (pipe)': ['\|', 'simple'] };

var pipe_provider = vscode.languages.registerHoverProvider('csv (pipe)', { provideHover(document, position, token) { return make_hover(document, position, 'csv (pipe)', token); } });

boeningc avatar Apr 03 '18 05:04 boeningc

@boeningc what about package.json ? Did you modify it? And you probably don't need the backslash in dialect_map.

mechatroner avatar Apr 04 '18 02:04 mechatroner

DOH! I did not. Didn't even look at it. :(

boeningc avatar Apr 04 '18 02:04 boeningc

Success! Thank you so much for the quick responses and pointers. :)

boeningc avatar Apr 04 '18 02:04 boeningc

@boeningc You are welcome!

mechatroner avatar Apr 04 '18 03:04 mechatroner

Hi all, I couldn't follow this thread exactly. I have a text file where columns are separated by one or more spaces (not tabs). Is it possible to use this type of file with rainbow_csv?

robertlugg avatar Apr 30 '18 22:04 robertlugg

@robertlugg No, it is not possible with current version. But you can substitute whitespaces with tabs in your file globally: s/ */\t/g and use TSV syntax. If you want a permanent solution you can also modify the TSV syntax file (replace all \t with *), combine it with modified package.json file and you will get your own mini-extension just from these two files. I don't want to include new grammars into "Rainbow CSV" until the linked VSCode issue is resolved. That is because each variation creates a new language that pollute language selection menu, and I think all such use cases are pretty rare compared to CSV/TSV/CSV (semicolon). Also some users need just simple whitespace separated files, the others may want a different grammar where whitespaces can be escaped with backslashes or double-quotes.

mechatroner avatar May 01 '18 02:05 mechatroner

Very keen to see pipe-delimiting for .dat files soon.

harvest316 avatar May 26 '18 02:05 harvest316

OK, I think it would make sense to add more grammars, there is no point to wait for Microsoft/vscode#1800

First candidate is obviously "pipe-separated" files. I won't be able to associate it with any filetype, but it will still be available with manual selection. The only question is whether anyone needs "quoted" pipe separated syntax, where fields containing pipe characters can be enclosed in double-quotes to escape them?

Another two separators that I think could be relevant are colon and double-quote.

Also I will probably implement csv and csv-semicolon grammars which doesn't allow quoted fields, this will allow to change original csv and csv(semicolon) grammars and highlight lines with unbalanced double-quotes as "errors".

The mentioned multi-space separated files, which many *nix utility produce as output, are definitely very relevant, but there is a technical issue, that will complicate the implementation. So it will take time to make this.

Single space-separated files could be useful, but people can incorrectly assume that this grammar is for multi-space separated files.

So the plan is not to add all possible separators and escape rule combinations, but only those that are practical.

mechatroner avatar May 26 '18 02:05 mechatroner

In my experience, the most common pipe-delimited files are the .DAT files you get when uploading & downloading batch payment files to banks and payment gateways. They are never quoted, and generally come with a fairly irrelevant 1-2 line header (no column names) and a single-line footer that contains the number of rows and total of the dollar amounts in the file. Often the header and footer do not contain pipes, only the actual data rows have pipe delimiters.

harvest316 avatar May 26 '18 03:05 harvest316

@harvest316 Thanks, this is interesting! I don't want to add .DAT -> 'pipe' association on the extension level, but turns out there is a way to add this mapping manually through VSCode config: https://stackoverflow.com/a/36789145/2898283 So, I will just include this instruction into README.md

mechatroner avatar May 26 '18 03:05 mechatroner

Just stumbled into https://code.visualstudio.com/docs/extensionAPI/extension-points#_contributeslanguages and this leads me to a comfort enhancement request:

What about reading the firstLine property mentioned in the article, counting the number of commas and the number of semicolons there, and whatever is the bigger figure, choose CSV or CSV (semicolon delimited) as the language of the file? This can go wrong, for sure, but if it saves x% of language switching, it‘s worth the price.

One detail use case: no header line and only floats with comma as decimal point. I.e. 1,1;2,2;3,3;... it has equal number of commas and semicolons or even one comma more. My personal preference is to choose ;-delimited in this case.

Thanks

Lercher avatar May 26 '18 10:05 Lercher

@Lercher I didn't know about this feature, but I think it will give too many false positives: a lot of non-csv files can contain commas or semicolons in the first line. Also I think it is not right to measure worth of this feature by percentage of switching: switch back could be more emotionally expensive since incorrect filetype detection would be very annoying. The right way to do content based-autodetection is by analyzing first 10 lines of a file, I can't imagine a situation where this would fail. I am sure that sooner or later VSCode will support this, but for now we will just have to use manual selection mechanism.

mechatroner avatar May 27 '18 04:05 mechatroner

If you say so.

However, I guess, if one of the counts is zero and the other one positive, then the method won't produce any false positives. IMHO this reduces switching business to non-existent for all files containing headers with names that are derived from identifiers of programming languages or DBMSs.

Lercher avatar May 27 '18 14:05 Lercher

I've published updated version, the only change is that now Rainbow CSV supports pipe | separator. I probably should have done it long ago, but better late than never I suppose. The Readme doc file was also updated with a table of supported separated and instructions how to create extension -> separator association, this could be useful in some cases.

mechatroner avatar Jun 10 '18 04:06 mechatroner

Hello, I use Rainbow CSV and I really enjoy it ;) I have a question though: I often work simultaneously with various csv files, and they don't all use the same separator: some of them are semicolon separated, while others use pipes as separators. I've tried to modify VSCode's Rainbow CSV parameters, but it only seems to take in account one separator at a time. For instance, setting "*.csv": "CSV (semicolon,pipe)" did not work. Is there any way I can get those lovely colours on both types of csv file at a time?

GrisPetitDragon avatar Jun 29 '18 13:06 GrisPetitDragon

Hello, @GrisPetitDragon , Thanks for feedback! It will be possible once content-based auto-detection is implemented. It is trivial to implement, but I need VSCode API call, which is currently missing, to switch language ID. See the linked VS Code ticket.

mechatroner avatar Jul 01 '18 17:07 mechatroner

Good news: Microsoft/vscode#1800 is complete. I even took a part in writing the API implementation :sunglasses: So this allows to add auto-detection functionality and possibly more CSV dialects, since their selection would be much more convenient.

mechatroner avatar Sep 21 '18 03:09 mechatroner

Thank you!!! :)

harvest316 avatar Sep 27 '18 22:09 harvest316

I've just published version 0.7.0 which has content-based separator autodetection logic. The new functionality will work only with VSCode 1.28, for older VSCode versions there should be no change in behavior.

mechatroner avatar Oct 13 '18 02:10 mechatroner

Thank you so much!

GrisPetitDragon avatar Oct 17 '18 12:10 GrisPetitDragon

@GrisPetitDragon you are welcome! Actually there is an issue with current implementation: separator autodetection will only work for "plaintext" files with unassigned language. i.e. if a table file has '.txt' or some unknown extension (e.g. '.unknown') - autodetection will work and switch it to "csv" or "csv (semicolon)" depending on it's content. But it won't switch ".csv" file to semicolon language even if it is really a semicolon separated file. I plan to fix this soon.

mechatroner avatar Oct 19 '18 02:10 mechatroner