ChoETL icon indicating copy to clipboard operation
ChoETL copied to clipboard

Stream compatibility and Parquet writing

Open Hedwig8 opened this issue 3 years ago • 1 comments

Hello, I found your library to be very useful to convert between different data formats! I'm using specifically conversion between CSV->Parquet/Avro and SQL->Parquet/Avro.

Nevertheless, I encountered some "issues" and I believe it is easier to ask:

  • I have all my data in remote locations (right now either some sql server or azure blob storage) and I don't know beforehand the amount of data I'm dealing with beforehand. Because of this, I want to use streams to enable reading/writing by chunks and not needing to load it all in memory. Do the readers/writters use the streams for chunk processing or they just load everything in memory before writing it all at once?
  • I'm also finding it odd that the resulting Parquet files can be twice as big as the respective CSV file. It is even worse for AVRO. I doubt it has anything to do with the library but I thought bringing it up would not hurt. The original csv size is 150MB:
  • I dont know if it is because I'm running my application in debug mode, or just Visual Studio things, but I'm noticing my application never releasing any memory after the read/write task. Since you have IDisposable in every class, could this be Parquet.NET fault?

Thanks in advance and I hope you keep up with the good stuff!

Hedwig8 avatar Nov 02 '22 11:11 Hedwig8

  1. ChoETL is stream based, not load everything in memory
  2. Parquet file size - you can use compression to reduce the size .Configure(c => c.CompressionMethod = Parquet.CompressionMethod.Gzip)

Hope it helps.

Cinchoo avatar Nov 04 '22 13:11 Cinchoo