altair icon indicating copy to clipboard operation
altair copied to clipboard

`TypeError` if DataFrame contains duplicated column name (in some cases)

Open saiwing-yeung opened this issue 2 years ago • 3 comments

This is kind of an edge case but the error message makes it somewhat difficult to identify the underlying issue. If you have a Pandas DataFrame where there are duplicated column names and they are not integers, you'd get an exception when trying to plot something. MWE:

import io
df = pd.read_csv(io.StringIO("""
a, b, c, d
0, 1, 2, 2022-01-01
2, 3, 4, 2022-01-01
"""))
df.columns = ['a', 'b', 'c', 'c']
alt.Chart(df).mark_point().encode(x='a', y='b')

results in

TypeError: to_list_if_array() got an unexpected keyword argument 'convert_dtype'

Note that

  • the duplicated columns are not used in plotting.
  • if both duplicated columns are of type integer, then you would just get a warning. But with most other types (including floats) it would generate an exception.
  • besides explicit renaming the columns like this, another scenario where you'd accidentally generate duplicated column names is calling toPandas() after join two PySpark DataFrames.

Using altair 4.2.0

saiwing-yeung avatar Nov 16 '22 19:11 saiwing-yeung

My sense is that this issue should be handled by pandas and that it should not be possible to create a dataframe where two columns have the same name. Have you raised this on their issue tracker?

joelostblom avatar Feb 06 '23 00:02 joelostblom

I think it would be nice to raise a more informative error. It came up for me too.

In my case, I had a big dataframe that causes some encoding errors if I dump the whole dataframe in. So I made a list of the subset of columns that I wanted to send to Altair. However, if this list is long or generated by complex logic, then it is easy to mistakenly include one column name twice.

A cartoon of my workflow was somewhat as follows, but I kept about 10 of 300 columns when I sent it to Altair and my list of ~10 had a duplicate.:

df = pd.DataFrame({'created_at':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05']})
df["y"] = range(1, len(df)+1, 1)

keep_cols = ['created_at', 'created_at', 'y']  # <-- the logic that created this list led to a duplicate column name.

alt.Chart(df[keep_cols]).mark_circle(size=60).encode(
    x="created_at:T",
    y="y", 
)

JanetMatsen avatar May 04 '23 17:05 JanetMatsen

Maybe we could introduce something like df.flags.allows_duplicate_labels = False (docs) in santize_dataframe which is where this error is raised, but I wonder why this isn't the default in pandas already so I opened https://github.com/pandas-dev/pandas/issues/53217

joelostblom avatar May 13 '23 20:05 joelostblom