fast-csv Keeping track of byte count / character count?

Keeping track of byte count / character count?

Open willsmanley opened this issue 4 years ago • 1 comments

Thank you for the amazing lib! Until now, I didn't appreciate the complexity of CSV parsing and you guys did a great job with it.

I have a project that requires parsing a CSV from a remote location (S3) in a short-lived lambda function. For large files, this will require iterating through multiple lambda function invocations, each one starting where the previous one left off. AWS' S3 SDK allows you to start reading from a specific byte number instead of starting at the beginning of the file every time. I've figured out all of the mechanics for that part, but I wanted to provide this info for context.

So I will need to be able to keep track of how many bytes have been read on each row so that I can know where to continue when I invoke the next lambda function.

First attempt: take the row provided by fast-csv's .on('data') callback and try to calculate the number of bytes present in that data. The problem with this is that fast-csv removes byteful characters (like empty rows, line endings, escape characters, etc...).

Since this approach didn't work, I thought that it would be pretty easy to just write my own CSV parser and keep track of the bytes myself.

This actually worked really well at first. I successfully processed super large files this way.

That is, until I encountered the complexities involved with CSVs that have stray line endings inside of the fields. And on top of that, those special fields are escaped with quotation marks....

There is a ton of logic in packages/parse/src/parser that handles these special cases. I would strongly prefer to utilize fast-csv's existing implementation rather than try to rebuild it myself.

I'm guessing counting bytes/characters is not a feature you want to add since it is pretty unusual. But I wanted to ask if you had advice for creating a fork that supports byte/character counting. Where in packages/parse/src/parser would be the most sensible place to count bytes (or characters) and what would be the best way to expose it to the .on('data') pipe closure?

Thanks in advance for any advice! If the solution is sensible, I can make a PR to see if other users would find value in a byte/character counter.

Jun 18 '21 06:06 willsmanley

I would also like this feature. I have set up a row counter that this output pipes into before the final destination, and skip ahead with skipRows, but this is processed at O(N) speed. Skipping bytes would work at constant speed, which makes a big difference for my ~20m row files.

I could implement this in userland as an intermediate step before fast-csv, but this requires a lot of weird glue, since skipping ahead in the files also means skipping the headers of the CSV files. So I'd have to have a separate step to get the headers and pass them in to fast-csv in a second iteration.

So ideally, fast-csv would both emit byte count and provide a convenient way to skip ahead while still getting the CSV header, such as shown below.

import * as csv from 'fast-csv';

const resumeFromBytes = 1000;
const filename = 'sample.csv';

createReadStream(filename, { start: resumeFromBytes })
  .pipe(
    csv.parse({
      headers: await csv.readHeaders(createReadStream(filename)),
    })
  )
  /**
   * __metadata__ includes the bytes which can be used to resume from above 
   */
  .on('data', row => console.log(row.__metadata__));

Dec 25 '21 06:12 chrbala

fast-csv fast-csv copied to clipboard

Keeping track of byte count / character count?

fast-csv
fast-csv copied to clipboard