koalas icon indicating copy to clipboard operation
koalas copied to clipboard

save and load parquet with MultiIndex (row) index and columns

Open ikravets opened this issue 4 years ago • 1 comments

I'm experimenting with Koalas. My pandas dataframes use MultiIndex both for rows and columns. Such pandas dataframes can be saved to / loaded from parquet files using PyArrow. Koalas can successfully translate such dataframes to/from pandas. However, Koalas cannot save/load such dataframes directly to/from parquet. Having to go through Pandas just to load/store the data severely limits the supported data size and kind of defeats the purpose of using Koalas.

PyArrow stores the information necessary to reconstruct MultiIndex in parquet metadata. It would be nice to have Koalas use the same approach for better compatibility, maybe even reuse PyArrow lib. Pointers to PyArrow implementation:

Right now Koalas supports MultiIndex save/load for rows, but it requires specifying index_col parameter for each to_parquet()/read_parquet() call, which is inferior to PyArrow approach.

ikravets avatar May 10 '20 12:05 ikravets

FYI: for the read path, it was resolved at #1695.

ueshin avatar Aug 04 '20 18:08 ueshin