node-csv icon indicating copy to clipboard operation
node-csv copied to clipboard

CSV_INVALID_CLOSING_QUOTE in large datasets

Open sehrgut opened this issue 5 years ago • 8 comments
trafficstars

I'm using node-csv to pluck data from large CSV files. On extremely large datasets (>500MB), I'm getting this. I'm pretty sure it's related to incorrect handling of a buffer boundary, because the location of the error shifts if I change the preceding data. For instance, this particular instance disappeared when I shifted the 100,000-line window by 1 line (using head to pipe gunzip output to node).

It doesn't seem to be strictly size-based, since I have been able to parse up to 250MB before seeing this error, though it is reproducible by searching for subsets that exhibit this behaviour in as little as 30MB. That's why I think it's likely due to a buffer boundary occurring within a quoted object.

All data in one of these subsets is parseable if divided into small batches, so I'm certain it's not a syntax error.

Unfortunately, the dataset is logs that can't be shared for security reasons, and I haven't had a change to try constructing a random dataset exhibiting this behaviour yet.

events.js:292
      throw er; // Unhandled 'error' event
      ^

CsvError: Invalid Closing Quote: got " " at line 90262 instead of delimiter, row delimiter, trimable character (if activated) or comment
    at Parser.__parse (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/csv-parse/lib/index.js:529:17)
    at Parser._transform (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/csv-parse/lib/index.js:403:22)
    at Parser.Transform._read (_stream_transform.js:191:10)
    at Parser.Transform._write (_stream_transform.js:179:12)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Parser.Writable.write (_stream_writable.js:318:11)
    at /Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:640:33
    at Stream.s._send (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1560:9)
    at Stream.write (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1661:18)
Emitted 'error' event on Stream instance at:
    at Stream._send (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:998:18)
    at push (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1526:19)
    at /Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:2212:13
    at Stream.s._send (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1560:9)
    at Stream.write (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1658:18)
    at Stream._send (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:984:26)
    at push (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1526:19)
    at /Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:2212:13
    at Stream.s._send (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1560:9)
    at Stream.write (/Users/USERNAME/Documents/src/tools/mobile-logspam/node_modules/highland/lib/index.js:1658:18) {
  code: 'CSV_INVALID_CLOSING_QUOTE',
  column: 'log',
  empty_lines: 0,
  header: false,
  index: 6,
  invalid_field_length: 0,
  quoting: false,
  lines: 90262,
  records: 90260
}

sehrgut avatar Jun 23 '20 20:06 sehrgut

I could be due to an index boundary but we will need to reproduce it. I personally use and test the library with multi gigabytes files and we never had a single report in the last 10 years about index boundaries being corrupted. We can't fix a problem we cannot reproduce. Unless you provide us a sample reproducing the issue, we won't be able to validate the issue and fix it if any. Do you confirm that your logs files don't contain any quote character?

wdavidw avatar Jun 24 '20 10:06 wdavidw

I'll see about sanitizing the dataset for internal privacy and try building a repro.

On Wed, Jun 24, 2020, 6:09 AM Worms David [email protected] wrote:

I could be due to an index boundary but we will need to reproduce it. I personally use and test the library with multi gigabytes files and we never had a single report in the last 10 years about index boundaries being corrupted. We can't fix a problem we cannot reproduce. Unless you provide us a sample reproducing the issue, we won't be able to validate the issue and fix it if any. Do you confirm that your logs files don't contain any quote character?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/adaltas/node-csv/issues/265#issuecomment-648728545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJAKLEB27XAPBB6KHKEL3TRYHGFFANCNFSM4OGAZUUA .

sehrgut avatar Jun 24 '20 17:06 sehrgut

I'm experiencing a similar problem. CSV file size is 3.5Gb+. My observation so far - the issue does seem to happen when if:

  • parallel setting is set to any value other than 1
  • CSV content is sparse, meaning columns of each row contain arbitrary values

The same ETL pipeline (CSV -> Transform -> DB) works fine with large CSV files (3Gb+) when data in each row is "similar".

Sometimes CSV_RECORD_DONT_MATCH_COLUMNS_LENGTH is thrown for no reason.

Here's the code example:

  const parser = parse({ delimiter, columns: true, trim: true });
  const transformer = transform(
    { parallel: parallelism },
    (
      { id: referenceId, ...rest }: any,
      callback: (err?: null | Error, record?: any) => void,
    ) =>
      setImmediate(() => {
        try {
          callback(
            undefined,
            `... transformed string ...`
          );
        } catch (e) {
          callback(e);
        }
      }),
  );
  const dbStream = connection.query(
    from(`COPY "${TABLE_NAME}" FROM STDIN DELIMITER '${delimiter}' CSV;`),
  );

  return new Promise((resolve, reject) => {
    const endStreaming = (e: Error) => {
      parser.end();
      transformer.end();
      dbStream.end();
      reject(e);
    };

    fileReadableStream.on('error', endStreaming);
    parser.on('error', endStreaming);
    transformer.on('error', endStreaming);
    dbStream.on('error', endStreaming).on('finish', resolve);

    fileReadableStream
      .pipe(parser)
      .pipe(transformer)
      .pipe(dbStream);
  });

moliinyk avatar Sep 08 '20 11:09 moliinyk

At the end of the day, the issue was fixed.

The first step was to add quote: '':

const parser = parse({ delimiter, columns: true, trim: true, quote: '' });
...

The second step required to modify the data. Here is an approximate example:

Before:

id;example;column;names
1;"{ ""values"": [""a"", ""b"", ""c""] }";foo;bar
...

After:

id;example;column;names
1;{"values": ["a", "b", "c"] };foo;bar
...

In my example, I've been using a semicolon ; as a delimiter.

I had ~500 quotation marks in a single row and I had 1.7m rows in CSV file.

1;"{ ""values"": ... is also correct but as I mentioned earlier, I managed to consistently parse up to ~190Mb of data before CSV_INVALID_CLOSING_QUOTE was thrown.

Note: the issue exists in the latest version of csv-parse library.

moliinyk avatar Sep 09 '20 10:09 moliinyk

Do you have any idea what the issue is?

wdavidw avatar Sep 09 '20 16:09 wdavidw

Do you have any idea what the issue is?

Unfortunately, I have no idea as to why it might be happening.

I'm not familiar with the codebase, however, I'm trying to read the code (occasionally). As soon as I get a bit closer to the root cause I'll try to report back.

moliinyk avatar Sep 14 '20 12:09 moliinyk

UPDATE: I'm uploading this file from a web form and parsing it as a read stream. There seems to be something happening in that process that's causing the error. If I parse the file directly from the filesystem, it works. So potentially not related to the above.

I'm getting a similar error with a much smaller file (will attach). I'm getting:

Invalid Closing Quote: got "C" at line 358 instead of delimiter, record delimiter, trimable character (if activated) or comment

If I remove that line, I get the same error. If I remove the line before/after, I get the same error (with the same 358 line number). Makes me think the problem is a missing quote much earlier in the file and csv-parse has some kind of "stop reading a column if it's longer than x characters" limit that just happens to stop at that point in the file.

LibreOffice has no problem reading the file.

Calling with:

const parser = parse({
  delimiter: ',',
  columns: false,
  skip_empty_lines: true,
  trim: true,
});

Full error:

CsvError: Invalid Closing Quote: got "C" at line 358 instead of delimiter, record delimiter, trimable character (if activated) or comment
    at Parser.__parse (file:///path/to/node_modules/csv-parse/lib/index.js:707:17)
    at Parser._transform (file:///path/to/node_modules/csv-parse/lib/index.js:576:22)
    at Parser.Transform._write (node:internal/streams/transform:184:23)
    at writeOrBuffer (node:internal/streams/writable:389:12)
    at _write (node:internal/streams/writable:330:10)
    at Parser.Writable.write (node:internal/streams/writable:334:10)
    at ReadStream.ondata (node:internal/streams/readable:754:22)
    at ReadStream.emit (node:events:390:28)
    at ReadStream.emit (node:domain:475:12)
    at addChunk (node:internal/streams/readable:315:12) {
  code: 'CSV_INVALID_CLOSING_QUOTE',
  bytes: 65476,
  comment_lines: 0,
  empty_lines: 0,
  invalid_field_length: 0,
  lines: 358,
  records: 99,
  columns: false,
  error: undefined,
  header: false,
  index: 9,
  raw: undefined,
  column: 9,
  quoting: false

Here's the file: tubs.csv

A few things to note about the file:

  • It escapes quotes with "".
  • Some column values span multiple lines.

cullylarson avatar Apr 02 '22 00:04 cullylarson

@cullylarson I imported your code and it was working, not much I can do from there.

wdavidw avatar Apr 04 '22 07:04 wdavidw