PapaParse icon indicating copy to clipboard operation
PapaParse copied to clipboard

chunkSize do not have any effect, defaults to around 65KB

Open apmcodes opened this issue 6 years ago • 7 comments

Trying to set chunkSize to 50Kb but no matter what I set it seems to read round 65Kb chunk. Have tried all the 3 settings individually, but do not have any effect on chunk size (number of lines read from csv on each chunk call back remains the same)

options.chunkSize = 40000

Papa.RemoteChunkSize = 40000;

Papa.LocalChunkSize = 40000;

Even after setting options.chunkSize = null, Papa parses in multiple chunks

Please help ...

apmcodes avatar Jan 12 '19 00:01 apmcodes

duh ...

apmcodes avatar Jan 13 '19 21:01 apmcodes

Hi, can you show your configuration ?

minified papaparse version : 4.6.0

After some test i've found that setting worker: true will read the local file with the default size chunk (which is 10MiB) even when setting Papa.LocalChunkSize to another value.

For that, i used the "large file" (~49MiB) provided in the demo and the following configuration (from the documentation) :

Papa.parse(file, {
	delimiter: "",	// auto-detect
	newline: "",	// auto-detect
	quoteChar: '"',
	escapeChar: '"',
	header: false,
	transformHeader: undefined,
	dynamicTyping: false,
	preview: 0,
	encoding: "",
	worker: false,
	comments: false,
	step: undefined,
	complete: parseComplete,
	error: undefined,
	download: false,
	skipEmptyLines: false,
	chunk: chunkComplete,
	fastMode: undefined,
	beforeFirstChunk: undefined,
	withCredentials: undefined,
	transform: undefined,
	delimitersToGuess: [',', '\t', '|', ';', Papa.RECORD_SEP, Papa.UNIT_SEP]
});

var nbChunks = 0;

function parseComplete(results, file)
{
	console.info("parseComplete");
	console.log(nbChunks);
	nbChunks = 0;
}

function chunkComplete(results, parser)
{
	nbChunks++;
}

Let's play with that while altering worker and Papa.LocalChunkSize

worker Papa.LocalChunkSize nbChunks
false default (10*2**20) 5 ✔️
true default (10*2**20) 5 ✔️
false 2**20 48 ✔️
true 2**20 5 ❌

As a workaround, i set worker: false and a function in chunk. Seems to work so far. @apmcodes hope that helped you

Serrulien avatar Feb 09 '19 18:02 Serrulien

forgot to say that when you set worker to false, it won't launch any workers of course

Serrulien avatar Feb 09 '19 19:02 Serrulien

I checked the old issues. Workers do use the given chunk size with the chunkSize configuration property (undocumented). Avoid using Papa.LocalChunkSize with workers.

Serrulien avatar Feb 09 '19 20:02 Serrulien

@Serrulien Thank you very much for the detailed explanation. Sorry for the late reply.

Please note: Using PapaParse in an Express app using multer middleware to upload file as multi-part.

It seems that as I'm using cloud service (S3) as remote file location and using aws-s3 sdk STREAMING api, chunkSize do not seems to have any effect (not sure if streaming is causing this issue).

The chunk size received seems to hover around 15KB (~300 rows with few columns)

NOTE: Even while streaming csv file from browser directly (no cloud storage) to PapaParse in the express app, observed the same behaviour of chunkSize.

Config
            header: false, 
            skipEmptyLines: true,
            chunk: this.importDB.bind(this), 
            beforeFirstChunk: this.importModel.bind(this),
            complete: this.importFinish.bind(this, this.cb),
            error: this.importError.bind(this),
            encoding: "utf8",
            preview: 0,
            chunkSize: 40000
            // chunkSize : 1024*1024*10,    // No effect

Info fetched from PapaParse cursor object

results count 687
receivedSize 47657

apmcodes avatar May 16 '19 20:05 apmcodes

Any updates on this?

akash-rajput avatar Dec 02 '19 07:12 akash-rajput

I had this issue using fs.createReadStream to create the file. It appears that there is a buffer inside the stream that's about 10 MB. So it's not PapaParse's fault.

If this is your issue, you can pass parameters to fs.createReadStream to let it buffer more.
Something like this snippet should get you started...

Papa.LocalChunkSize =  Papa.LocalChunkSize * 10;
const file = fs.createReadStream(dataPath, { highWaterMark: Papa.LocalChunkSize });

WarrenWilkinson avatar Feb 28 '22 17:02 WarrenWilkinson