node-csv icon indicating copy to clipboard operation
node-csv copied to clipboard

Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment

Open master-maintenance1-peer-connect opened this issue 1 year ago • 5 comments

Summary

Please provide an implementation of a Transformer that can parse large CSV files in a browser environment.

Motivation

I tried to implement such a CsvParseTransformer. However, when dealing with large files that contain long data with line breaks in a single column and put pressure on the browser’s memory, It was necessary to use internal APIs not exported by csv-parse, as shown below. This raised concerns about future functionality. And, I was unable to apply the AbortController to the CsvParseTransformer and could not implement the purging of its internal buffer. also,I was unable to implement the csv-parse compliant procedure for interrupting the CsvTransformAPI as AbortController .

import { transform as CsvTransformAPI } from ‘…/…/node_modules/csv-parse/lib/api/index.js’

Alternative

Therefore, I would like you to provide a Transformer that is guaranteed to work continuously.

** CsvParseTransformer Draft**

import { transform as CsvTransformAPI } from '../../node_modules/csv-parse/lib/api/index.js'
import { CsvError } from '../../node_modules/csv-parse/lib/api/CsvError.js';
import type { Options as csvParseOptions, Info as csvParseInfo } from 'csv-parse';

export type ICsvBufferTransformData = { info: csvParseInfo, record: string[] };
export class CsvParseTransformer implements Transformer<Uint8Array, ICsvBufferTransformData>{
    csvTransform: { // @TODO: csv-parse/lib/api transform.parseの内部仕様を 直接使用していて、危うい
        parse: (nextBuf: Uint8Array | undefined, isEnd: boolean,
            pushFunc: (data: ICsvBufferTransformData) => void,
            closeFunc: () => void) => CsvError | Error | undefined
    };
    constructor(csvParseOption: csvParseOptions) {
        this.csvTransform = CsvTransformAPI(csvParseOption);
    }
    transform(chunk: Uint8Array, controller) {
        // console.log("at CsvParseTransformer.transform", chunk)
        try {
            const err = this.csvTransform.parse(chunk, false, (data: ICsvBufferTransformData) => {
                // console.log("at CsvParseTransformer.transform parsed:", data)
                controller.enqueue(data);
            }, () => controller.terminate());
            if (err) {
                // console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse", err);
                controller.error(err);
            }
        } catch (err) {
            console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse catch", err);
            controller.error(err);
        }
    }
    flush(controller) {
        try {
            const err = this.csvTransform.parse(undefined, true, (data: ICsvBufferTransformData) => {
                controller.enqueue(data)
            }, () => controller.terminate());
            if (err) {
                // console.error("ERROR at CsvParseTransformer transform last csvTransform.parse", err);
                controller.error(err);
            }
        } catch (err) {
            console.error("ERROR at CsvParseTransformer transform last csvTransform.parse catch", err);
            controller.error(err);
        }
    }
}

** exsample usage of CsvParseTransformer It is convenient to be able to transfer data(from <input type=file>) to a restful API using pipelining.

export function csvFile_upload2_kintone(file: File,
    csvUpDownTaskConfig1: ICSVUpDownTaskConfig, csvHeaders: string[],
    on_record: (data: { info: csvParseInfo, record: any }, context) => { info: csvParseInfo, record: any },
    on_write: (recordCount: number) => void) {
    const readableStrategy_countQueuing = Kintone_API_records_limit;
    const writableStrategy_highWaterMark = Math.max((csvHeaders?.length || 10) * 255 * readableStrategy_countQueuing, CSV_File_buffer_limit_size);

    const abortController = new AbortController();
    const kintneSvWriter = new kintoneStreamFactory_uploadRecords(csvUpDownTaskConfig1, abortController, on_write);
    const csvFileStream = file.stream();
    return csvFileStream.pipeThrough<ICsvBufferTransformData>(
        new TransformStream(new CsvParseTransformer(({
            autoParseDate: false,
            delimiter: Kintone_SEPARATOR,
            encoding: "utf-8",
            bom: csvUpDownTaskConfig1.csvFileWithBOM,
            escape: csvUpDownTaskConfig1.csvStringEscape,
            trim: true,
            record_delimiter: Kintone_LINE_BREAK, //"\r\n",
            relax_column_count: true,
            relax_quotes: true,
            skip_empty_lines: true,
            max_record_size: writableStrategy_highWaterMark,
            on_record: on_record,
            from: 2,
            columns: false,
            info: true
        }))
            ,
            new ByteLengthQueuingStrategy({ highWaterMark: writableStrategy_highWaterMark }),
            new CountQueuingStrategy({ highWaterMark: readableStrategy_countQueuing })
        ))
        .pipeThrough<KintoneRecordForParameter>(new TransformStream(new csv2KintoneRecordsTransform(csvUpDownTaskConfig1)))
        .pipeThrough<KintoneRecordForParameter[]>(new TransformStream(new CsvRecordBufferingTransformer(Kintone_API_records_limit)))
        .pipeTo(kintneSvWriter.getWriter(), { signal: abortController.signal })
        .finally(() => {
            try {
                csvFileStream.getReader().cancel();
                console.log("at csvFile_upload2_kintone csvFileStream.getReader().cancel() sucess")
            } catch (_) { }
        })
}

I think it would a good idea to pass a max_buffer_size configuration parameter to the parser constructor so that we can better control how much memory the parser is allowed to use. I'm having a similar problem parsing csv files in a worker on edge where there are hard memory limits.

PabloReszczynski avatar Nov 06 '23 15:11 PabloReszczynski

What do you mean by max_buffer_size, I don't find any reference to this parameter in the Node.js stream API. Any option passed to the parser is also passed to the underlying stream.

wdavidw avatar Nov 06 '23 18:11 wdavidw

Hi,

I'd like to add a +1 to this as we're encountering a similar issue. We're trying to parse large input files via the ESM streaming API without excessive memory pressure in the browser and hitting issues.

ermi-ltd avatar Apr 24 '24 14:04 ermi-ltd

Would you be able to share a reproducible script in JS (no TS) ?

wdavidw avatar Apr 24 '24 15:04 wdavidw

@wdavidw - I'll work on something when I'm back at my desk for ya.

ermi-ltd avatar Apr 24 '24 15:04 ermi-ltd