Index=False is not working in to_parquet command
import pandas as pd df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) df.to_parquet('df.parquet',compression=None, index=False) pd.read_parquet('df.parquet')
I am running this code and it shows index. I set index as False.
Thanks for the report @yelizkilinc. I believe this is working as intended - index=False just specifies to not write the dataframe index. But when the parquet is read, the dataframe still needs to have an index - since no index was written, the default index will be used (0 to len(df)). The effect of index=False can be seen with a dataframe that does not have a default index:
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, index=[2, 3])
df.to_parquet('df.parquet', index=False)
print(pd.read_parquet('df.parquet'))
gives
col1 col2
0 1 3
1 2 4
Yes, you are right. But I want to get rid of the default index. While writing to CSV or other files, I can get rid of this index by setting index=False, as I know. But while I am writing to parquet file, I was expecting the same. I do not want to see the index in the parquet file. Do you know any way to solve it?
Do you know that the index is still in the parquet file? What I was saying above is that the index is not written to the parquet, the default index is just being added on the read_parquet call.
I think yes. I can see the input of the parquet file only by this way. How can I ensure that?
The main problem is that even if I have only two column names (col1,col2) in the dataframe just like in the example, I see col1,col1,col2 (double col1)on Redshift after running the AWS crawler. I thought it may be because of the default index and there is no column name for default index, and something goes wrong.
Doesn't look like index is being written by loading with pyarrow directly:
index=True:
import pyarrow
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet', index=True)
print(pyarrow.parquet.read_table('df.parquet'))
pyarrow.Table
col1: int64
col2: int64
__index_level_0__: int64
index=False:
import pyarrow
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
df.to_parquet('df.parquet', index=False)
print(pyarrow.parquet.read_table('df.parquet'))
pyarrow.Table
col1: int64
col2: int64
Index is definitely still being written to parquet file when index=False is specified. And this is breaking my copy into Redshift. Any idea about how to mitigate this issue?