pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Index=False is not working in to_parquet command

Open yelizkilinc opened this issue 4 years ago • 6 comments

import pandas as pd df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) df.to_parquet('df.parquet',compression=None, index=False) pd.read_parquet('df.parquet')

I am running this code and it shows index. I set index as False.

yelizkilinc avatar Mar 23 '21 14:03 yelizkilinc

Thanks for the report @yelizkilinc. I believe this is working as intended - index=False just specifies to not write the dataframe index. But when the parquet is read, the dataframe still needs to have an index - since no index was written, the default index will be used (0 to len(df)). The effect of index=False can be seen with a dataframe that does not have a default index:

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, index=[2, 3])
df.to_parquet('df.parquet', index=False)
print(pd.read_parquet('df.parquet'))

gives

   col1  col2
0     1     3
1     2     4

mzeitlin11 avatar Mar 23 '21 15:03 mzeitlin11

Yes, you are right. But I want to get rid of the default index. While writing to CSV or other files, I can get rid of this index by setting index=False, as I know. But while I am writing to parquet file, I was expecting the same. I do not want to see the index in the parquet file. Do you know any way to solve it?

yelizkilinc avatar Mar 24 '21 09:03 yelizkilinc

Do you know that the index is still in the parquet file? What I was saying above is that the index is not written to the parquet, the default index is just being added on the read_parquet call.

mzeitlin11 avatar Mar 24 '21 13:03 mzeitlin11

I think yes. I can see the input of the parquet file only by this way. How can I ensure that?

The main problem is that even if I have only two column names (col1,col2) in the dataframe just like in the example, I see col1,col1,col2 (double col1)on Redshift after running the AWS crawler. I thought it may be because of the default index and there is no column name for default index, and something goes wrong.

yelizkilinc avatar Mar 24 '21 13:03 yelizkilinc

Doesn't look like index is being written by loading with pyarrow directly:

index=True:

import pyarrow
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet', index=True)
print(pyarrow.parquet.read_table('df.parquet'))
pyarrow.Table
col1: int64
col2: int64
__index_level_0__: int64

index=False:

import pyarrow
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet', index=False)
print(pyarrow.parquet.read_table('df.parquet'))
pyarrow.Table
col1: int64
col2: int64

mzeitlin11 avatar Mar 24 '21 17:03 mzeitlin11

Index is definitely still being written to parquet file when index=False is specified. And this is breaking my copy into Redshift. Any idea about how to mitigate this issue?

Syrus avatar Apr 22 '22 04:04 Syrus