altair
altair copied to clipboard
`TypeError` if DataFrame contains duplicated column name (in some cases)
This is kind of an edge case but the error message makes it somewhat difficult to identify the underlying issue. If you have a Pandas DataFrame where there are duplicated column names and they are not integers, you'd get an exception when trying to plot something. MWE:
import io
df = pd.read_csv(io.StringIO("""
a, b, c, d
0, 1, 2, 2022-01-01
2, 3, 4, 2022-01-01
"""))
df.columns = ['a', 'b', 'c', 'c']
alt.Chart(df).mark_point().encode(x='a', y='b')
results in
TypeError: to_list_if_array() got an unexpected keyword argument 'convert_dtype'
Note that
- the duplicated columns are not used in plotting.
- if both duplicated columns are of type integer, then you would just get a warning. But with most other types (including floats) it would generate an exception.
- besides explicit renaming the columns like this, another scenario where you'd accidentally generate duplicated column names is calling
toPandas()
after join two PySpark DataFrames.
Using altair 4.2.0
My sense is that this issue should be handled by pandas and that it should not be possible to create a dataframe where two columns have the same name. Have you raised this on their issue tracker?
I think it would be nice to raise a more informative error. It came up for me too.
In my case, I had a big dataframe that causes some encoding errors if I dump the whole dataframe in. So I made a list of the subset of columns that I wanted to send to Altair. However, if this list is long or generated by complex logic, then it is easy to mistakenly include one column name twice.
A cartoon of my workflow was somewhat as follows, but I kept about 10 of 300 columns when I sent it to Altair and my list of ~10 had a duplicate.:
df = pd.DataFrame({'created_at':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05']})
df["y"] = range(1, len(df)+1, 1)
keep_cols = ['created_at', 'created_at', 'y'] # <-- the logic that created this list led to a duplicate column name.
alt.Chart(df[keep_cols]).mark_circle(size=60).encode(
x="created_at:T",
y="y",
)
Maybe we could introduce something like df.flags.allows_duplicate_labels = False
(docs) in santize_dataframe
which is where this error is raised, but I wonder why this isn't the default in pandas already so I opened https://github.com/pandas-dev/pandas/issues/53217