pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Inconsistent handling of named column indices in to_csv() and read_csv().

Open normanius opened this issue 1 year ago • 1 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of pandas.
  • [X] I have confirmed this bug exists on the main branch of pandas.

Issue Description

Apparently, it is not possible to read in a .csv table with named column index AND named row index.

How to reproduce:

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), 
                  columns=["A", "B", "C"]).round(1)
df.index.name = "Count"
df.columns.name = "Name"

# INCONSISTENCY 1: 
# .to_csv() only saves the row index name, it ignores the column index name.
df.to_csv("out.csv")

# The work-around: convert to column index to MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns])

# Now, .to_csv() stores both column and index names.
df.to_csv("out.csv")

This is how the file looks like:

Name,A,B,C
Count,,,
0,1.8,0.4,1.0
1,2.2,1.9,-1.0
2,1.0,-0.2,-0.1
3,0.4,0.1,1.5
4,0.8,0.1,0.4

However, when reading in, I found no way to preserve both the index name of row and column indices.

df_new = pd.read_csv("out.csv")
print(df_new)
# ==> Line 1 is a row instead of column header

df_new = pd.read_csv("out.csv", header=[0])
print(df_new)
# ==> Line 1 is a row instead of column header

df_new = pd.read_csv("out.csv", header=[0, 1])
print(df_new)
# ==> Creates a two-level multi-index column header

df_new = pd.read_csv("out.csv", header=[0, 1], index_col=[0])
print(df_new)
# ==> Creates a two-level multi-index column header

df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info

df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info

I think this behavior is inconsistent, as switching to a multi-index with two or more levels works as expected:

# INCONSISTENCY 2:
# A multi-index column header with >1 levels is handled correctly
df_mh = df.copy()
l1 = list("ABC")
l2 = list("abc")
df_mh.columns = pd.MultiIndex.from_arrays([l1, l2])
df_mh.columns.names = ["Name1", "Name2"]
df_mh.to_csv("out_mh.csv")
df_new = pd.read_csv("out_mh.csv", header=[0,1], index_col=[0])

This reads in the CSV correctly.

The following SO posts are related:

  • https://stackoverflow.com/questions/35047842
  • https://stackoverflow.com/questions/67499614

Expected Behavior

In summary, I identified the following inconsistencies:

  1. For a simple pd.Index, the row index name is preserved using to_csv(). However, the column index name is not preserved. This is unexpected, and at least could be documented
  2. For the corner case with a pd.MultiIndex using one level, pd.read_csv() fails to read the multi-header back in.

I know, it's a corner case. However, the fact that the column index can have a name, and the fact that column and header are both represented using the same object, it is somewhat unexpected that those inconsistencies occur.

Installed Versions

INSTALLED VERSIONS

commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.10.5.final.0 python-bits : 64 OS : Darwin OS-release : 21.6.0 Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : None.UTF-8

pandas : 2.2.0 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 58.1.0 pip : 24.0 Cython : 0.29.30 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.2 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.7.1 numba : None numexpr : 2.9.0 odfpy : None openpyxl : None pandas_gbq : None pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

normanius avatar Feb 16 '24 13:02 normanius

Thanks for the report. Is there a format we could use that would be backwards compatible and allow for storing the column index name? If not, I do not think we should make a backwards incompatible change to support it. But we can document that it will get ignored.

As an aside, the CSV format is a good way to store simple tabular data in readable plaintext, but it is not a good format for round-tripping. If you are looking for round-tripping, I would suggest parquet.

rhshadrach avatar Feb 16 '24 21:02 rhshadrach

Thank you for the answer. Yes, I totally agree with your point that CSV is a limited format and that Parquet and perhaps HDF5 are more suitable. It still occurs to me that when working with technically less skilled persons, CSV is hard to avoid.

I don't exactly get the rational for the following behavior:

# Two-level header
df1 = pd.read_csv("data.csv", header=[0,1], index_col=[0])
# One-level header
df2 = pd.read_csv("data.csv", header=[0], index_col=[0])

df1 is read as expected, with and without named column-levels, and also if the row index is not named. Note that depending on the presence of a row index name, read_csv() consumes either two or three lines before reading in the row data. So read_csv() can choose correctly based on some inner logic which case applies. For this special logic to be triggered, one has to provide both the arguments header=[0,1] and index_col=[0]. In this sense, the special reading logic is explicitly requested.

However, in the case of a single-level column index, pandas cannot correctly distinguish between the case with and without index name, that is, a CSV with or without an extra second line reserved for the row index name. Why is this ability lost?

The backwards compatibility requirement could be circumvented by using an extra flag. However, read_csv() already has a lot of parameters, so that introducing another argument affects the usability. In that sense, I'm also not so fond of adding an extra parameter.

normanius avatar Feb 22 '24 00:02 normanius

However, in the case of a single-level column index, pandas cannot correctly distinguish between the case with and without index name, that is, a CSV with or without an extra second line reserved for the row index name. Why is this ability lost?

I do not know the reason, I think investigation here is necessary. If there is a way to make this more consistent and it's backwards compatible, then I'd likely be positive on the change.

In that sense, I'm also not so fond of adding an extra parameter.

Agreed.

rhshadrach avatar Feb 23 '24 19:02 rhshadrach

I looked into this for a bit and the internal logic behind df_new = pd.read_csv("out_mh.csv", header=[0,1], index_col=[0]) that infers a third header row has at least four more minor issues related to input sanitizing in addition to being inconsistent with the non-MultiIndex case:

  1. Adding a trailing comma to csv line 2 while using engine=python does not lead to a ParserError("Header rows must have an equal number of columns.") and the csv gets parsed as before. Technically this is also a header, but one can argue the specialness of this row...
  2. Doing the above for the c engine leads to a mangled index, with the MultiIndex being interpreted as tuples Index([0, ('A', 'a'), ('B', 'b'), ('C', 'c')], dtype='object')
  3. Doing the above for the pyarrow engine leads to a TypeError: an integer is required
  4. Removing all values from line 2 leads to only the MultiIndex with two levels being created when using the c and python engine, the empty inferred header row being ignored. Under pyarrow, this leads to exception 3.

One of the potentially backwards compatible options for providing the index name would be to have an additional header row that has one more column than the other header rows. It would start with the index name and otherwise only contain delimiters. One could then explicitly include this additional row in the header list (in our MultiIndex example header=[0,1,2]). The extraneous column would distinguish this type of header row from one that is supposed to contain an additional MultiIndex level with generated default column names.

Potentially backwards compatible since this kind of index name row is currently either

  • illegal when explicitly providing this index name row (exception from issue 1.), or
  • leads to a bug (2.) or an error (3.) when depending on the inferring logic when using a MultiIndex

Though this solution seems hackish to me.

I would at least look into fixing the inferring logic inconsistencies 1.-4.

dontgoto avatar Feb 24 '24 01:02 dontgoto