parquetjs icon indicating copy to clipboard operation
parquetjs copied to clipboard

Help Required : What is the best way of writing 10000 records at once ?

Open NaveenShanmugam opened this issue 6 years ago • 3 comments

NaveenShanmugam avatar Jun 12 '19 13:06 NaveenShanmugam

I'm trying to accomplish a similar thing. Anyone has an idea?

Thanks.

mytototo avatar Jul 30 '19 11:07 mytototo

I think the best strategy would be to create a ParquetWriter and then repeatedly call appendRow on it.

The writer has a setting called rowGroupSize that controls how many rows are buffered in memory before a disk flush is performed. See https://github.com/ironSource/parquetjs#buffering--row-group-size and https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L96

The best value for the row group size depends on your input data and maybe it's best to choose it experimentally. A too small row group size will result in reduced compression efficiency and an increased filesize while increasing the value also increases peak memory usage of the writer. I would start with something like 8192 and see how it goes.

asmuth avatar Jul 30 '19 12:07 asmuth

Anyway the flush seems to be performed by the ParquetWriter but not from the WriteStream. The parquet file on local disk will be 1 KB until when the stream closes (i think this is because of the NodeJs write stream behaviour). Has anybody found a workaround?

dani-pisca avatar Jul 16 '21 14:07 dani-pisca