node-csv Issues with ignore_last

trafficstars

Describe the bug

ignore_last_delimiters: true seems to have some issues:

If you don't use relax_quotes: true then it fails with CsvError$1: Invalid Closing Quote: got "," at line 4 instead of delimiter, record delimiter, trimable character (if activated) or comment
If you use trim: true or rtrim: true then it fails with CsvError$1: Invalid Closing Quote: found non trimable byte after quote at line 4
If you avoid the two issues above then the final string ends up with extra quotes and the final delimiter, e.g. "'123123-12312312", (the double quotes are part of the string)

To Reproduce

const fs = require('fs');
const csv = require('csv');

const INPUT = `
Date, Type, Description, Value, Balance, Account Name, Account Number

11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;

const stream = csv.parse(INPUT, {
    skip_empty_lines: true,
    ltrim: true,
    rtrim: true,
    columns: true,
    relax_quotes: true,
    ignore_last_delimiters: true,
});

Additional context

node version: 16.14.0
csv version: 6.0.5

Mar 22 '22 22:03 SystemParadox

Looking at your input, it seems like a combinaison of skip_empty_lines and relax_column_count will do, see my sample code:

const {parse} = require('csv/sync');

const input = `
Date, Type, Description, Value, Balance, Account Name, Account Number

11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;

const output = parse(input, {
  skip_empty_lines: true,
  relax_column_count: true
});

Mar 23 '22 08:03 wdavidw

Thanks, that's a better workaround as it avoids issue (3).

Mar 23 '22 13:03 SystemParadox

Wait, I don't think this should be closed, it's just a workaround. As documented, ignore_last_delimiters is the correct solution and it's still bugged in various ways.

Mar 23 '22 23:03 SystemParadox

I confirm that we don't handle yet quotes with ignore_last_delimiters and it generates an unexpected error. Looking at how to implement this, there are divergent ways to handle it, does a,"b",c,"d" leads to ["a","b,c,d"] or ["a","b\",c,\"d"]. I am not yet confident which one is the best, here are my thoughts:

First one is already handled by relax_column_count, that what it is made for. Agreed that it generates multiple fields and not one, ["a","b","c","d"] instead of ["a","b,c,d"] but this is easily handled post-parsing.
The ignore_last_delimiters is here to support SSA, no idea how/if SSA support quotes but if it does, we shall probably handle quotes the same way. However, I haven't find a tool to edit/generate SSA files yet.

Apr 05 '22 07:04 wdavidw

The confusion/complication here seems to be trying to use ignore_last_delimiters to support this SSA subtitles case. If I say "ignore last delimiters", I expect them to be ignored, not included as part of the last column.

I would strongly suggest that this behaviour is moved to a separate option, or as a string option to ignore_last_delimiters like this:

ignore_last_delimiters: false (default behaviour)
ignore_last_delimiters: true (ignore any trailing delimiters beyond the expected column count)
ignore_last_delimiters: 'include-in-last-column'

Additionally, I should point out that relax_column_count is not desirable for this case - I want a strict column count as regards to data, but I just want to ignore any trailing delimiters. If there is data past the trailing delimiter then it's not a trailing delimiter it's an extra column and I'm expecting an invalid column count error.

Apr 05 '22 17:04 SystemParadox

Thinking about this a bit further, I would suggest deprecating ignore_last_delimiters and creating these two options:

ignore_trailing_delimiters (to strip off trailing delimiters)
merge_extra_columns (to support the SSA use case)

Apr 05 '22 17:04 SystemParadox

Speaking about ignore_trailing_delimiters:

what you are suggesting is a,"b",c,"d" yields to ["a","b,c,d"], right ?
If so, I am concerned about how to handle a,"b""c","d,e". The first field contains escaped quotes, the second one contains a field delimiter.

Apr 05 '22 20:04 wdavidw

No that's exactly what ignore_trailing_delimiters shouldn't do. Trailing delimeters are trailing delimiters only. a,"b",c,"d" doesn't have any trailing delimiters because all delimiters are followed by other characters. Trailing delimiters means a,b,c,d, or a,b,c,d,,,,,.

This option should never affect the parsing/escaping of quotes and it should never cause data to be ignored. It only affects how many empty string values are produced by trailing commas at the end of a row.

a,"b""c","d,e" is ['a','b"c','d,e'] and should always be so unless you've changed the quote characters or disabled escaped quotes. In my opinion there shouldn't be any other options that change this behaviour. It's confusing, it's ambiguous and I sure hope nobody ever needs or wants this - even the SSA parsing can be done in a much simpler and safer way than this.

To be clear, this is what ignore_trailing_delimiters should do:

Example 1

// input
A,B,C,D
a,b,c,d,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 2 (any number of extra trailing delimiters are ignored)

// input
A,B,C,D
a,b,c,d,,,,,,,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 3 (ignore trailing delimiters in the column definition)

// input
A,B,C,D,
a,b,c,d
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }

Example 4 (too many columns should still be an error)

// input
A,B,C,D
a,b,c,d,e
// output
Error: Row 1: Expected 4 columns, got 5

Example 5 (keep trailing delimiters up to the expected number of columns)

// input
A,B,C,D
a,b,c,
// output
{ A: 'a', B: 'b', C: 'c', D: '' }

Example 6 (too few columns should still be an error)

// input
A,B,C,D
a,b,c
// output
Error: Row 1: Expected 4 columns, got 3

Apr 05 '22 20:04 SystemParadox

OK, this is much more clear.

Apr 06 '22 07:04 wdavidw

node-csv
node-csv copied to clipboard

Issues with ignore_last_delimiters

Example 1

Example 2 (any number of extra trailing delimiters are ignored)

Example 3 (ignore trailing delimiters in the column definition)

Example 4 (too many columns should still be an error)

Example 5 (keep trailing delimiters up to the expected number of columns)

Example 6 (too few columns should still be an error)

node-csv node-csv copied to clipboard

Issues with ignore_last_delimiters

Example 1

Example 2 (any number of extra trailing delimiters are ignored)

Example 3 (ignore trailing delimiters in the column definition)

Example 4 (too many columns should still be an error)

Example 5 (keep trailing delimiters up to the expected number of columns)

Example 6 (too few columns should still be an error)

node-csv
node-csv copied to clipboard