Add parquet support for importing and exporting data to/from DataFrame.
At the current moment, the only supported way to import/export data is to use CSV format; it would be very useful to add support for some more commonly used data formats or databases.
What other data formats are you thinking of? If we can get specific formats people want it helps with prioritizing things.
@michaelcfanning I think that Parquet would be useful too, not sure about other formats.
I think you meant @michaelgsharp but, since I'm here, I agree Parquet would be useful. :)
@michaelcfanning ah sorry my bad :)
I'll go ahead and change this title to be adding parquet support then.
Glad you could join us, other Michael :)
Hi, I'm one of the maintainers of ParquetSharp, and we've just released the first version of ParquetSharp.DataFrame which supports reading Parquet into DataFrames and exporting DataFrames to Parquet. The API isn't yet stable but functionality should be fairly complete so we'd love to get some feedback.
Microsoft may still want to provide built-in Parquet support in Microsoft.Data.Analysis so this probably doesn't fix this issue, but is hopefully useful to anyone following this issue, and if our work could be useful for providing that built-in Parquet support we'd be happy to help out where we can.
@adamreeve thanks for pointing this out!
We are currently in the process of creating/prioritizing our dataprep plan so we will definitely keep this in mind.
I've been out of the loop here for a while, so this may not be as accurate anymore: FWIW, at one point the main thought here was to create a separate DataFrame.IO library that could incrementally add support for the different data formats in common use. Just something to consider.
I like the idea of the separate IO library and believe that performant access to a format such as Parquet that will allow interop between .Net and Python as users see fit would be a good thing. CSVs aren't particularly good for maintaining types or large data sets.
DataFrame only seems to support primitive data types, unlike parquet which supports arrays, structs, maps, lists etc.
Therefore I can't see this working unless DataFrame adds support for complex types?
@aloneguid hm doesn't it support complex types like DataFrame in Spark does? In any case, it's still possible to make it import/export files that contain primitive types only.
My concern currently is the limited amount of data interop between different languages. The defacto is currently Pandas which supports both parquet and feather. In order to gain better traction any libraries/languages positioning themselves in this space need to be able to interop large data sets with the defacto else they will find it very hard to improve their usage.
I am hitting this issue where, for a lot of things C# is clearly the better language choice but due to the inability to deal with a viable data interop format (CSV is not it) that Python or R can use it is a non-starter. This leads to the instance where yet another company will decrease its use of a language.
As @andrei-faber mentions - you don't need to support every type a format can accept first up, but supporting primitive types fully would be a start that could be built upon.
@aloneguid hm doesn't it support complex types like DataFrame in Spark does? In any case, it's still possible to make it import/export files that contain primitive types only.
I couldn't find it. There is #6088 that requests to add a simple array support but that's all. Looking at pandas, even struct support is a bit awkward, but I'm not an expert, maybe someone can clarify why there is only basic types support, is this sufficient?