node-csv
node-csv copied to clipboard
Issues with ignore_last_delimiters
Describe the bug
ignore_last_delimiters: true seems to have some issues:
- If you don't use
relax_quotes: truethen it fails withCsvError$1: Invalid Closing Quote: got "," at line 4 instead of delimiter, record delimiter, trimable character (if activated) or comment - If you use
trim: trueorrtrim: truethen it fails withCsvError$1: Invalid Closing Quote: found non trimable byte after quote at line 4 - If you avoid the two issues above then the final string ends up with extra quotes and the final delimiter, e.g.
"'123123-12312312",(the double quotes are part of the string)
To Reproduce
const fs = require('fs');
const csv = require('csv');
const INPUT = `
Date, Type, Description, Value, Balance, Account Name, Account Number
11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;
const stream = csv.parse(INPUT, {
skip_empty_lines: true,
ltrim: true,
rtrim: true,
columns: true,
relax_quotes: true,
ignore_last_delimiters: true,
});
Additional context
- node version: 16.14.0
- csv version: 6.0.5
Looking at your input, it seems like a combinaison of skip_empty_lines and relax_column_count will do, see my sample code:
const {parse} = require('csv/sync');
const input = `
Date, Type, Description, Value, Balance, Account Name, Account Number
11/02/2022,BAC,"'FOO",10.00,33432.80,"'REF","'123123-12312312",
`;
const output = parse(input, {
skip_empty_lines: true,
relax_column_count: true
});
Thanks, that's a better workaround as it avoids issue (3).
Wait, I don't think this should be closed, it's just a workaround. As documented, ignore_last_delimiters is the correct solution and it's still bugged in various ways.
I confirm that we don't handle yet quotes with ignore_last_delimiters and it generates an unexpected error. Looking at how to implement this, there are divergent ways to handle it, does a,"b",c,"d" leads to ["a","b,c,d"] or ["a","b\",c,\"d"]. I am not yet confident which one is the best, here are my thoughts:
- First one is already handled by
relax_column_count, that what it is made for. Agreed that it generates multiple fields and not one,["a","b","c","d"]instead of["a","b,c,d"]but this is easily handled post-parsing. - The
ignore_last_delimitersis here to support SSA, no idea how/if SSA support quotes but if it does, we shall probably handle quotes the same way. However, I haven't find a tool to edit/generate SSA files yet.
The confusion/complication here seems to be trying to use ignore_last_delimiters to support this SSA subtitles case. If I say "ignore last delimiters", I expect them to be ignored, not included as part of the last column.
I would strongly suggest that this behaviour is moved to a separate option, or as a string option to ignore_last_delimiters like this:
ignore_last_delimiters: false(default behaviour)ignore_last_delimiters: true(ignore any trailing delimiters beyond the expected column count)ignore_last_delimiters: 'include-in-last-column'
Additionally, I should point out that relax_column_count is not desirable for this case - I want a strict column count as regards to data, but I just want to ignore any trailing delimiters. If there is data past the trailing delimiter then it's not a trailing delimiter it's an extra column and I'm expecting an invalid column count error.
Thinking about this a bit further, I would suggest deprecating ignore_last_delimiters and creating these two options:
ignore_trailing_delimiters(to strip off trailing delimiters)merge_extra_columns(to support the SSA use case)
Speaking about ignore_trailing_delimiters:
- what you are suggesting is
a,"b",c,"d"yields to["a","b,c,d"], right ? - If so, I am concerned about how to handle
a,"b""c","d,e". The first field contains escaped quotes, the second one contains a field delimiter.
No that's exactly what ignore_trailing_delimiters shouldn't do. Trailing delimeters are trailing delimiters only. a,"b",c,"d" doesn't have any trailing delimiters because all delimiters are followed by other characters. Trailing delimiters means a,b,c,d, or a,b,c,d,,,,,.
This option should never affect the parsing/escaping of quotes and it should never cause data to be ignored. It only affects how many empty string values are produced by trailing commas at the end of a row.
a,"b""c","d,e" is ['a','b"c','d,e'] and should always be so unless you've changed the quote characters or disabled escaped quotes. In my opinion there shouldn't be any other options that change this behaviour. It's confusing, it's ambiguous and I sure hope nobody ever needs or wants this - even the SSA parsing can be done in a much simpler and safer way than this.
To be clear, this is what ignore_trailing_delimiters should do:
Example 1
// input
A,B,C,D
a,b,c,d,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }
Example 2 (any number of extra trailing delimiters are ignored)
// input
A,B,C,D
a,b,c,d,,,,,,,
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }
Example 3 (ignore trailing delimiters in the column definition)
// input
A,B,C,D,
a,b,c,d
// output
{ A: 'a', B: 'b', C: 'c', D: 'd' }
Example 4 (too many columns should still be an error)
// input
A,B,C,D
a,b,c,d,e
// output
Error: Row 1: Expected 4 columns, got 5
Example 5 (keep trailing delimiters up to the expected number of columns)
// input
A,B,C,D
a,b,c,
// output
{ A: 'a', B: 'b', C: 'c', D: '' }
Example 6 (too few columns should still be an error)
// input
A,B,C,D
a,b,c
// output
Error: Row 1: Expected 4 columns, got 3
OK, this is much more clear.