fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

Incorrect roundtrip of index names on filtered dataframe

Open philippjfr opened this issue 3 years ago • 6 comments

When saving a filtered dataframe to parquet using Pandas and fastparquet the index names are round-tripped incorrectly: ​

import pandas  as pd 

df = pd._testing.makeMixedDataFrame()

filtered_df = df[df.A>=1]

filtered_df.to_parquet('test.parq', engine='fastparquet')

loaded_df = pd.read_parquet('test.parq')

print(filtered_df.index.names)
print(loaded_df.index.names)
FrozenList([None])
FrozenList(['index'])

Versions

fastparquet 0.7.2 pandas 1.3.2

philippjfr avatar Jan 19 '22 12:01 philippjfr

Hello, I believe this to be a behavior of fastparquet to be expected. When you filter, I am guessing that the index in the dataframe is not a range index any longer, meaning, it becomes stored by fastparquet as a specific column. In this case, when an index without name is resetted as a column in fastparquet, it is given the default name (by pandas actually) 'index'. This is what you see.

Is your index correct? (is it what you expect?) (to check there is no bug at this level)

You would find the behavior you are expecting by using write_index=False when calling fastparquet. I believe that pandas forwards parameters to fastparquet. So you would get the expected behavior by:

filtered_df.to_parquet('test.parq', engine='fastparquet', write_index=False)

Bests

yohplala avatar Jan 19 '22 13:01 yohplala

Thanks @yohplala, I can see that reasoning and your technical explanation makes sense. However I still disagree that this is expected, I would expect a DataFrame to round-trip exactly as is, i.e. it should pass pd.testing.assert_frame_equal(original_df, loaded_df). If you switch to engine='pyarrow' it behaves as expected.

philippjfr avatar Jan 19 '22 14:01 philippjfr

Indeed, I think we can call this a bug. Indeed, parquet requires that the column being saved must have a real str name, but we also save pandas metadata, in which we can give the actual final name of the index. Either we are not writing the metadata, or we are not applying it correctly - can check by doing the roundtrip pyarrow/fastparquet and fastparquet/pyarrow.

This behaviour has been around a long time, I think, and there are tests in dask which use both engines and explicitly ignore the name of the index, if it was None. Fixing this might break those tests! Personally, I think "index" is a fine name for an index :)

martindurant avatar Jan 19 '22 14:01 martindurant

@yohplala , you are probably in a good place to ensure None roundtrips, if you have any interest. I can fix any tests that this causes to fail in Dask. I have the feeling the issue isn't high priority.

martindurant avatar Jan 24 '22 15:01 martindurant

Hi @martindurant , to be honest, I have no need for this, and am only able to code in spare time, few hours per week. So this will be a very low priority for me. This said, be assured I am a proactive supporter of fastparquet, and I would propose to leave this ticket opened. In the short term, I am prioritizing private developments, that I think I should be able to deal with within the 2 next months. After those, I was thinking to deal with some tickets of fastparquet. I don't know if I will deal with this one 1st, but let's keep it in the stack.

yohplala avatar Jan 24 '22 21:01 yohplala

No rush! I might do it myself also, but I have a similar problem with finding time :)

martindurant avatar Jan 24 '22 22:01 martindurant