vscode_rainbow_csv icon indicating copy to clipboard operation
vscode_rainbow_csv copied to clipboard

Consider skipping first (metadata) row during autodetection if the number of fields in it is inconsistent with other rows.

Open serkonda7 opened this issue 2 years ago • 2 comments

Take for example the following CSV that Mircosoft uses:

Machine Vulnerabilities Export,28 Feb 2023 11:42 AM +00:00
Severity,CVSS v3,Age (days),Has Exploit,Has Known Threats,Has Associated Alerts,Related Software,Description
Medium,6.5,14,False,False,False,mozilla:firefox;mozilla:firefox_esr;mozilla:thunderbird;suse:mozillafirefox;suse:mozillafirefox-devel;suse:mozillafirefox-translations-common;suse:mozillafirefox-translations-other;suse:mozillathunderbird;suse:mozillathunderbird-translations-common;suse:mozillathunderbird-translations-other;suse:libmozjs-102-0;suse:mozjs102;suse:mozjs102-devel;suse:mozillafirefox-branding-upstream,"This vulnerability affects the following vendors: Mozilla, Suse. To view more details about this vulnerability please visit the vendor website."
High,8.8,14,,False,False,False,ubuntu:firefox;mozilla:firefox,"This vulnerability affects the following vendors: Mozilla, Ubuntu. To view more details about this vulnerability please visit the vendor website."

The separator is ,, however is ; used which only is a cell value here.

Maybe look in the first row to find the correct separator.

serkonda7 avatar Mar 01 '23 14:03 serkonda7

The autodetection algorithm works this way:

  1. It tries to find a separator that would provide a consistent number of fields in each line. This attempt fails for this file and "," separator because the very first line has 2 fields, while other rows have 8 fields.
  2. If step 1 fails, if the extension is not ".csv" the autodetection algorithm just exits otherwise it just tries to find the best separator by frequency, and here separator ";" is probably more frequent than ",".

I guess one way to improve the algorithm would be to introduce step "1.5" which retries step 1 but without the first line since some csv files can contain meta information in it.

mechatroner avatar Mar 02 '23 23:03 mechatroner

Thank you for this detailed explanation! Your proposed solution seems suitable.

serkonda7 avatar Mar 03 '23 07:03 serkonda7