Stream compatibility and Parquet writing

Open Hedwig8 opened this issue 3 years ago • 1 comments

Hello, I found your library to be very useful to convert between different data formats! I'm using specifically conversion between CSV->Parquet/Avro and SQL->Parquet/Avro.

Nevertheless, I encountered some "issues" and I believe it is easier to ask:

I have all my data in remote locations (right now either some sql server or azure blob storage) and I don't know beforehand the amount of data I'm dealing with beforehand. Because of this, I want to use streams to enable reading/writing by chunks and not needing to load it all in memory. Do the readers/writters use the streams for chunk processing or they just load everything in memory before writing it all at once?
I'm also finding it odd that the resulting Parquet files can be twice as big as the respective CSV file. It is even worse for AVRO. I doubt it has anything to do with the library but I thought bringing it up would not hurt. The original csv size is 150MB:
I dont know if it is because I'm running my application in debug mode, or just Visual Studio things, but I'm noticing my application never releasing any memory after the read/write task. Since you have IDisposable in every class, could this be Parquet.NET fault?

Thanks in advance and I hope you keep up with the good stuff!

Nov 02 '22 11:11 Hedwig8

ChoETL is stream based, not load everything in memory
Parquet file size - you can use compression to reduce the size .Configure(c => c.CompressionMethod = Parquet.CompressionMethod.Gzip)

Hope it helps.

Nov 04 '22 13:11 Cinchoo