framework icon indicating copy to clipboard operation
framework copied to clipboard

Possible improvements to the parquet format implementation

Open roll opened this issue 3 years ago • 2 comments

Overview

From @ewheeler

One thing to note-- fastparquet and pyarrow libraries have some parquet handling differences with pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet https://github.com/pandas-dev/pandas/issues/42968#issuecomment-965318185

In the words of a pandas contributor: "Summary: it's a mess :)"

Pandas defaults to pyarrow and fails over to fastparquet-- but their parquet methods also take an 'engine' param so users can explicitly choose.

Each have their own pros/cons, different dependencies, and different compatibilities with other python packages. For an older version of python (3.4, i think? or maybe it was with a specific numpy version?), it was impossible to have both pyarrow and fastparquet installed in the same python environment. I'd guess that most folks and projects tend to use one or the other, depending on needs and other dependencies.

Not suggesting one versus the other, but wanted to flag since supporting one engine might limit how frictionless-py can be used with existing projects--so worth emphasizing in the docs and/or considering implementation of an 'engine' param a la pandas.


@kindly

My only concern with this is that you are limiting yourself to data that can be loaded "in memory" to a pandas dataframe. That might be acceptible to begin with. Howerver, ideally, as parquet is mostly considered a "big data" format and the stated aims of frictionless are:

Low memory consumption for data of any size Reasonable performance on big data Reading seems not too bad in terms of memroy consumption as you are only reading a single row group (which is essentially a set of rows) at a time, however there is no guarentee the row group will fit in memory.

As for writing I think the whole input is put into a dataframe before writing. The simplist solution to this is to put many smaller parquet files in a directory as most libararies support treating these files as one large file (including pandas). So you can chuck up the input into smaller dataframes then write them out to seperate files in that directory. The other option would be to write some library that takes a stream and uses a lower level parquet writer, but I could not find a decent python one.

For datapackage_convert I used the rust parquet libarary that could convert from the arrow memory representation in batches. The arrow library had a CSV reader built in which made it easy, but I am not sure how easy it would be with the dict streams that you use here.

roll avatar Jul 22 '22 08:07 roll

Hi @roll,

Thanks for tackling this and taking the time to think it through!

Couple quick thoughts:

  1. One advantage of going with pyarrow over fastparquet is that it should make it quite easy to support other Arrow file formats such as feather/IPC. There is also a lot of momentum behind pyarrow right now, and the API is quite clean and well thought-out, imo..

  2. Regarding the aims of frictionless vs. big data:

    1. this is perhaps obvious, but worth mentioning for context: anything considered 'big data' today is likely to be considered small-ish 10 years from now,
    2. if it's possible to support columnar data approaches without adding significant complexity, why not? If it does significantly complicate things with the frictionless architecture (you would know better), than it's definitely fair to debate the usefulness of adding the support vs. increasing complexity in the codebase.

Lastly, this is old and probably needs to be updated & re-evaluated (I'll try and take a stab at it in the coming weeks, if possible), but I did some benchmarking a while back comparing different file formats & compression standards for i/o:

https://github.com/khughitt/benchmark-compression

The performance of feather/parquet, even back then, for both speed and storage was already quite impressive, and is why I mostly work with those formats / think about using them for larger data applications, even when the columnar access is not critical.

khughitt avatar Jul 22 '22 14:07 khughitt

Thanks @khughitt!

And great benchmark BTW :+1:

roll avatar Jul 25 '22 06:07 roll