ChoETL
ChoETL copied to clipboard
Stream compatibility and Parquet writing
Hello, I found your library to be very useful to convert between different data formats! I'm using specifically conversion between CSV->Parquet/Avro and SQL->Parquet/Avro.
Nevertheless, I encountered some "issues" and I believe it is easier to ask:
- I have all my data in remote locations (right now either some sql server or azure blob storage) and I don't know beforehand the amount of data I'm dealing with beforehand. Because of this, I want to use streams to enable reading/writing by chunks and not needing to load it all in memory. Do the readers/writters use the streams for chunk processing or they just load everything in memory before writing it all at once?
- I'm also finding it odd that the resulting Parquet files can be twice as big as the respective CSV file. It is even worse for AVRO. I doubt it has anything to do with the library but I thought bringing it up would not hurt. The original csv size is 150MB:
- I dont know if it is because I'm running my application in debug mode, or just Visual Studio things, but I'm noticing my application never releasing any memory after the read/write task. Since you have IDisposable in every class, could this be Parquet.NET fault?
Thanks in advance and I hope you keep up with the good stuff!
- ChoETL is stream based, not load everything in memory
- Parquet file size - you can use compression to reduce the size
.Configure(c => c.CompressionMethod = Parquet.CompressionMethod.Gzip)
Hope it helps.