parquetjs
parquetjs copied to clipboard
Help Required : What is the best way of writing 10000 records at once ?
I'm trying to accomplish a similar thing. Anyone has an idea?
Thanks.
I think the best strategy would be to create a ParquetWriter and then repeatedly call appendRow on it.
The writer has a setting called rowGroupSize that controls how many rows are buffered in memory before a disk flush is performed. See https://github.com/ironSource/parquetjs#buffering--row-group-size and https://github.com/ironSource/parquetjs/blob/master/lib/writer.js#L96
The best value for the row group size depends on your input data and maybe it's best to choose it experimentally. A too small row group size will result in reduced compression efficiency and an increased filesize while increasing the value also increases peak memory usage of the writer. I would start with something like 8192 and see how it goes.
Anyway the flush seems to be performed by the ParquetWriter but not from the WriteStream. The parquet file on local disk will be 1 KB until when the stream closes (i think this is because of the NodeJs write stream behaviour). Has anybody found a workaround?