altair `TypeError` if DataFrame contains duplicated column name (in some cases)

`TypeError` if DataFrame contains duplicated column name (in some cases)

Open saiwing-yeung opened this issue 2 years ago • 3 comments

This is kind of an edge case but the error message makes it somewhat difficult to identify the underlying issue. If you have a Pandas DataFrame where there are duplicated column names and they are not integers, you'd get an exception when trying to plot something. MWE:

import io
df = pd.read_csv(io.StringIO("""
a, b, c, d
0, 1, 2, 2022-01-01
2, 3, 4, 2022-01-01
"""))
df.columns = ['a', 'b', 'c', 'c']
alt.Chart(df).mark_point().encode(x='a', y='b')

results in

TypeError: to_list_if_array() got an unexpected keyword argument 'convert_dtype'

Note that

the duplicated columns are not used in plotting.
if both duplicated columns are of type integer, then you would just get a warning. But with most other types (including floats) it would generate an exception.
besides explicit renaming the columns like this, another scenario where you'd accidentally generate duplicated column names is calling toPandas() after join two PySpark DataFrames.

Using altair 4.2.0

Nov 16 '22 19:11 saiwing-yeung

My sense is that this issue should be handled by pandas and that it should not be possible to create a dataframe where two columns have the same name. Have you raised this on their issue tracker?

Feb 06 '23 00:02 joelostblom

I think it would be nice to raise a more informative error. It came up for me too.

In my case, I had a big dataframe that causes some encoding errors if I dump the whole dataframe in. So I made a list of the subset of columns that I wanted to send to Altair. However, if this list is long or generated by complex logic, then it is easy to mistakenly include one column name twice.

A cartoon of my workflow was somewhat as follows, but I kept about 10 of 300 columns when I sent it to Altair and my list of ~10 had a duplicate.:

df = pd.DataFrame({'created_at':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05']})
df["y"] = range(1, len(df)+1, 1)

keep_cols = ['created_at', 'created_at', 'y']  # <-- the logic that created this list led to a duplicate column name.

alt.Chart(df[keep_cols]).mark_circle(size=60).encode(
    x="created_at:T",
    y="y", 
)

May 04 '23 17:05 JanetMatsen

Maybe we could introduce something like df.flags.allows_duplicate_labels = False (docs) in santize_dataframe which is where this error is raised, but I wonder why this isn't the default in pandas already so I opened https://github.com/pandas-dev/pandas/issues/53217

May 13 '23 20:05 joelostblom

altair altair copied to clipboard

`TypeError` if DataFrame contains duplicated column name (in some cases)

altair
altair copied to clipboard