BUG: Inconsistent handling of named column indices in to_csv() and read_csv().
Pandas version checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of pandas.
- [X] I have confirmed this bug exists on the main branch of pandas.
Issue Description
Apparently, it is not possible to read in a .csv table with named column index AND named row index.
How to reproduce:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3),
columns=["A", "B", "C"]).round(1)
df.index.name = "Count"
df.columns.name = "Name"
# INCONSISTENCY 1:
# .to_csv() only saves the row index name, it ignores the column index name.
df.to_csv("out.csv")
# The work-around: convert to column index to MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns])
# Now, .to_csv() stores both column and index names.
df.to_csv("out.csv")
This is how the file looks like:
Name,A,B,C
Count,,,
0,1.8,0.4,1.0
1,2.2,1.9,-1.0
2,1.0,-0.2,-0.1
3,0.4,0.1,1.5
4,0.8,0.1,0.4
However, when reading in, I found no way to preserve both the index name of row and column indices.
df_new = pd.read_csv("out.csv")
print(df_new)
# ==> Line 1 is a row instead of column header
df_new = pd.read_csv("out.csv", header=[0])
print(df_new)
# ==> Line 1 is a row instead of column header
df_new = pd.read_csv("out.csv", header=[0, 1])
print(df_new)
# ==> Creates a two-level multi-index column header
df_new = pd.read_csv("out.csv", header=[0, 1], index_col=[0])
print(df_new)
# ==> Creates a two-level multi-index column header
df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info
df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info
I think this behavior is inconsistent, as switching to a multi-index with two or more levels works as expected:
# INCONSISTENCY 2:
# A multi-index column header with >1 levels is handled correctly
df_mh = df.copy()
l1 = list("ABC")
l2 = list("abc")
df_mh.columns = pd.MultiIndex.from_arrays([l1, l2])
df_mh.columns.names = ["Name1", "Name2"]
df_mh.to_csv("out_mh.csv")
df_new = pd.read_csv("out_mh.csv", header=[0,1], index_col=[0])
This reads in the CSV correctly.
The following SO posts are related:
- https://stackoverflow.com/questions/35047842
- https://stackoverflow.com/questions/67499614
Expected Behavior
In summary, I identified the following inconsistencies:
- For a simple pd.Index, the row index name is preserved using
to_csv(). However, the column index name is not preserved. This is unexpected, and at least could be documented - For the corner case with a pd.MultiIndex using one level, pd.read_csv() fails to read the multi-header back in.
I know, it's a corner case. However, the fact that the column index can have a name, and the fact that column and header are both represented using the same object, it is somewhat unexpected that those inconsistencies occur.
Installed Versions
INSTALLED VERSIONS
commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.10.5.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : None.UTF-8
pandas : 2.2.0 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 24.0 Cython : 0.29.30 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : None numexpr : 2.9.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
Thanks for the report. Is there a format we could use that would be backwards compatible and allow for storing the column index name? If not, I do not think we should make a backwards incompatible change to support it. But we can document that it will get ignored.
As an aside, the CSV format is a good way to store simple tabular data in readable plaintext, but it is not a good format for round-tripping. If you are looking for round-tripping, I would suggest parquet.
Thank you for the answer. Yes, I totally agree with your point that CSV is a limited format and that Parquet and perhaps HDF5 are more suitable. It still occurs to me that when working with technically less skilled persons, CSV is hard to avoid.
I don't exactly get the rational for the following behavior:
# Two-level header
df1 = pd.read_csv("data.csv", header=[0,1], index_col=[0])
# One-level header
df2 = pd.read_csv("data.csv", header=[0], index_col=[0])
df1 is read as expected, with and without named column-levels, and also if the row index is not named. Note that depending on the presence of a row index name, read_csv() consumes either two or three lines before reading in the row data. So read_csv() can choose correctly based on some inner logic which case applies. For this special logic to be triggered, one has to provide both the arguments header=[0,1] and index_col=[0]. In this sense, the special reading logic is explicitly requested.
However, in the case of a single-level column index, pandas cannot correctly distinguish between the case with and without index name, that is, a CSV with or without an extra second line reserved for the row index name. Why is this ability lost?
The backwards compatibility requirement could be circumvented by using an extra flag. However, read_csv() already has a lot of parameters, so that introducing another argument affects the usability. In that sense, I'm also not so fond of adding an extra parameter.
However, in the case of a single-level column index, pandas cannot correctly distinguish between the case with and without index name, that is, a CSV with or without an extra second line reserved for the row index name. Why is this ability lost?
I do not know the reason, I think investigation here is necessary. If there is a way to make this more consistent and it's backwards compatible, then I'd likely be positive on the change.
In that sense, I'm also not so fond of adding an extra parameter.
Agreed.
I looked into this for a bit and the internal logic behind
df_new = pd.read_csv("out_mh.csv", header=[0,1], index_col=[0])
that infers a third header row has at least four more minor issues related to input sanitizing in addition to being inconsistent with the non-MultiIndex case:
- Adding a trailing comma to csv line 2 while using
engine=pythondoes not lead to aParserError("Header rows must have an equal number of columns.")and the csv gets parsed as before. Technically this is also a header, but one can argue the specialness of this row... - Doing the above for the c engine leads to a mangled index, with the MultiIndex being interpreted as tuples
Index([0, ('A', 'a'), ('B', 'b'), ('C', 'c')], dtype='object') - Doing the above for the pyarrow engine leads to a
TypeError: an integer is required - Removing all values from line 2 leads to only the MultiIndex with two levels being created when using the c and python engine, the empty inferred header row being ignored. Under pyarrow, this leads to exception 3.
One of the potentially backwards compatible options for providing the index name would be to have an additional header row that has one more column than the other header rows. It would start with the index name and otherwise only contain delimiters. One could then explicitly include this additional row in the header list (in our MultiIndex example header=[0,1,2]). The extraneous column would distinguish this type of header row from one that is supposed to contain an additional MultiIndex level with generated default column names.
Potentially backwards compatible since this kind of index name row is currently either
- illegal when explicitly providing this index name row (exception from issue 1.), or
- leads to a bug (2.) or an error (3.) when depending on the inferring logic when using a MultiIndex
Though this solution seems hackish to me.
I would at least look into fixing the inferring logic inconsistencies 1.-4.