machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Add parquet support for importing and exporting data to/from DataFrame.

Open andrei-faber opened this issue 4 years ago • 8 comments

At the current moment, the only supported way to import/export data is to use CSV format; it would be very useful to add support for some more commonly used data formats or databases.

andrei-faber avatar Oct 13 '21 15:10 andrei-faber

What other data formats are you thinking of? If we can get specific formats people want it helps with prioritizing things.

michaelgsharp avatar Oct 14 '21 21:10 michaelgsharp

@michaelcfanning I think that Parquet would be useful too, not sure about other formats.

andrei-faber avatar Oct 14 '21 21:10 andrei-faber

I think you meant @michaelgsharp but, since I'm here, I agree Parquet would be useful. :)

michaelcfanning avatar Oct 14 '21 22:10 michaelcfanning

@michaelcfanning ah sorry my bad :)

andrei-faber avatar Oct 14 '21 22:10 andrei-faber

I'll go ahead and change this title to be adding parquet support then.

Glad you could join us, other Michael :)

michaelgsharp avatar Oct 15 '21 20:10 michaelgsharp

Hi, I'm one of the maintainers of ParquetSharp, and we've just released the first version of ParquetSharp.DataFrame which supports reading Parquet into DataFrames and exporting DataFrames to Parquet. The API isn't yet stable but functionality should be fairly complete so we'd love to get some feedback.

Microsoft may still want to provide built-in Parquet support in Microsoft.Data.Analysis so this probably doesn't fix this issue, but is hopefully useful to anyone following this issue, and if our work could be useful for providing that built-in Parquet support we'd be happy to help out where we can.

adamreeve avatar Jan 12 '22 22:01 adamreeve

@adamreeve thanks for pointing this out!

We are currently in the process of creating/prioritizing our dataprep plan so we will definitely keep this in mind.

michaelgsharp avatar Jan 18 '22 18:01 michaelgsharp

I've been out of the loop here for a while, so this may not be as accurate anymore: FWIW, at one point the main thought here was to create a separate DataFrame.IO library that could incrementally add support for the different data formats in common use. Just something to consider.

pgovind avatar Jan 28 '22 06:01 pgovind

I like the idea of the separate IO library and believe that performant access to a format such as Parquet that will allow interop between .Net and Python as users see fit would be a good thing. CSVs aren't particularly good for maintaining types or large data sets.

totalgit74 avatar Nov 18 '22 01:11 totalgit74

DataFrame only seems to support primitive data types, unlike parquet which supports arrays, structs, maps, lists etc.

Therefore I can't see this working unless DataFrame adds support for complex types?

aloneguid avatar Apr 17 '23 10:04 aloneguid

@aloneguid hm doesn't it support complex types like DataFrame in Spark does? In any case, it's still possible to make it import/export files that contain primitive types only.

andrei-faber avatar Apr 18 '23 18:04 andrei-faber

My concern currently is the limited amount of data interop between different languages. The defacto is currently Pandas which supports both parquet and feather. In order to gain better traction any libraries/languages positioning themselves in this space need to be able to interop large data sets with the defacto else they will find it very hard to improve their usage.

I am hitting this issue where, for a lot of things C# is clearly the better language choice but due to the inability to deal with a viable data interop format (CSV is not it) that Python or R can use it is a non-starter. This leads to the instance where yet another company will decrease its use of a language.

As @andrei-faber mentions - you don't need to support every type a format can accept first up, but supporting primitive types fully would be a start that could be built upon.

totalgit74 avatar Apr 18 '23 22:04 totalgit74

@aloneguid hm doesn't it support complex types like DataFrame in Spark does? In any case, it's still possible to make it import/export files that contain primitive types only.

I couldn't find it. There is #6088 that requests to add a simple array support but that's all. Looking at pandas, even struct support is a bit awkward, but I'm not an expert, maybe someone can clarify why there is only basic types support, is this sufficient?

aloneguid avatar Apr 19 '23 10:04 aloneguid