node-csv
node-csv copied to clipboard
Request for a Reliable Transformer to Parse Large CSV Files in Browser Environment
Summary
Please provide an implementation of a Transformer that can parse large CSV files in a browser environment.
Motivation
I tried to implement such a CsvParseTransformer. However, when dealing with large files that contain long data with line breaks in a single column and put pressure on the browser’s memory, It was necessary to use internal APIs not exported by csv-parse, as shown below. This raised concerns about future functionality. And, I was unable to apply the AbortController to the CsvParseTransformer and could not implement the purging of its internal buffer. also,I was unable to implement the csv-parse compliant procedure for interrupting the CsvTransformAPI as AbortController .
import { transform as CsvTransformAPI } from ‘…/…/node_modules/csv-parse/lib/api/index.js’
Alternative
Therefore, I would like you to provide a Transformer that is guaranteed to work continuously.
** CsvParseTransformer Draft**
import { transform as CsvTransformAPI } from '../../node_modules/csv-parse/lib/api/index.js'
import { CsvError } from '../../node_modules/csv-parse/lib/api/CsvError.js';
import type { Options as csvParseOptions, Info as csvParseInfo } from 'csv-parse';
export type ICsvBufferTransformData = { info: csvParseInfo, record: string[] };
export class CsvParseTransformer implements Transformer<Uint8Array, ICsvBufferTransformData>{
csvTransform: { // @TODO: csv-parse/lib/api transform.parseの内部仕様を 直接使用していて、危うい
parse: (nextBuf: Uint8Array | undefined, isEnd: boolean,
pushFunc: (data: ICsvBufferTransformData) => void,
closeFunc: () => void) => CsvError | Error | undefined
};
constructor(csvParseOption: csvParseOptions) {
this.csvTransform = CsvTransformAPI(csvParseOption);
}
transform(chunk: Uint8Array, controller) {
// console.log("at CsvParseTransformer.transform", chunk)
try {
const err = this.csvTransform.parse(chunk, false, (data: ICsvBufferTransformData) => {
// console.log("at CsvParseTransformer.transform parsed:", data)
controller.enqueue(data);
}, () => controller.terminate());
if (err) {
// console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse", err);
controller.error(err);
}
} catch (err) {
console.error("ERROR at CsvParseTransformer transform continue csvTransform.parse catch", err);
controller.error(err);
}
}
flush(controller) {
try {
const err = this.csvTransform.parse(undefined, true, (data: ICsvBufferTransformData) => {
controller.enqueue(data)
}, () => controller.terminate());
if (err) {
// console.error("ERROR at CsvParseTransformer transform last csvTransform.parse", err);
controller.error(err);
}
} catch (err) {
console.error("ERROR at CsvParseTransformer transform last csvTransform.parse catch", err);
controller.error(err);
}
}
}
** exsample usage of CsvParseTransformer It is convenient to be able to transfer data(from <input type=file>) to a restful API using pipelining.
export function csvFile_upload2_kintone(file: File,
csvUpDownTaskConfig1: ICSVUpDownTaskConfig, csvHeaders: string[],
on_record: (data: { info: csvParseInfo, record: any }, context) => { info: csvParseInfo, record: any },
on_write: (recordCount: number) => void) {
const readableStrategy_countQueuing = Kintone_API_records_limit;
const writableStrategy_highWaterMark = Math.max((csvHeaders?.length || 10) * 255 * readableStrategy_countQueuing, CSV_File_buffer_limit_size);
const abortController = new AbortController();
const kintneSvWriter = new kintoneStreamFactory_uploadRecords(csvUpDownTaskConfig1, abortController, on_write);
const csvFileStream = file.stream();
return csvFileStream.pipeThrough<ICsvBufferTransformData>(
new TransformStream(new CsvParseTransformer(({
autoParseDate: false,
delimiter: Kintone_SEPARATOR,
encoding: "utf-8",
bom: csvUpDownTaskConfig1.csvFileWithBOM,
escape: csvUpDownTaskConfig1.csvStringEscape,
trim: true,
record_delimiter: Kintone_LINE_BREAK, //"\r\n",
relax_column_count: true,
relax_quotes: true,
skip_empty_lines: true,
max_record_size: writableStrategy_highWaterMark,
on_record: on_record,
from: 2,
columns: false,
info: true
}))
,
new ByteLengthQueuingStrategy({ highWaterMark: writableStrategy_highWaterMark }),
new CountQueuingStrategy({ highWaterMark: readableStrategy_countQueuing })
))
.pipeThrough<KintoneRecordForParameter>(new TransformStream(new csv2KintoneRecordsTransform(csvUpDownTaskConfig1)))
.pipeThrough<KintoneRecordForParameter[]>(new TransformStream(new CsvRecordBufferingTransformer(Kintone_API_records_limit)))
.pipeTo(kintneSvWriter.getWriter(), { signal: abortController.signal })
.finally(() => {
try {
csvFileStream.getReader().cancel();
console.log("at csvFile_upload2_kintone csvFileStream.getReader().cancel() sucess")
} catch (_) { }
})
}
I think it would a good idea to pass a max_buffer_size
configuration parameter to the parser constructor so that we can better control how much memory the parser is allowed to use.
I'm having a similar problem parsing csv files in a worker on edge where there are hard memory limits.
What do you mean by max_buffer_size
, I don't find any reference to this parameter in the Node.js stream API. Any option passed to the parser is also passed to the underlying stream.
Hi,
I'd like to add a +1 to this as we're encountering a similar issue. We're trying to parse large input files via the ESM streaming API without excessive memory pressure in the browser and hitting issues.
Would you be able to share a reproducible script in JS (no TS) ?
@wdavidw - I'll work on something when I'm back at my desk for ya.