hvplot Handling of non-string DataFrame column names

The test suite is currently broken on master after the latest dev release of HoloViews. The breakage is due to https://github.com/holoviz/holoviews/pull/5354 that raises an error earlier than before if the column names of a DataFrame contain an integer.

It means that in the current state this breaks white it used to work:

import numpy as np; import pandas as pd
import hvplot.pandas

df = pd.DataFrame(np.random.rand(10, 2))  # columns is [0, 1]
df.hvplot()

and raises:

DataError: pandas DataFrame column names used as dimensions must be strings not integers.

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

Pandas .plot

Pandas .plot is actually very flexible on the column names it accepts, these all work:

df.plot()
df.plot(y=1)

dft = pd.DataFrame(np.random.rand(10, 2), columns=[pd.Timestamp('2022/01/01'), pd.Timestamp('2023/01/01')])

dft.plot()
dft.plot(y=pd.Timestamp('2022/01/01'))

hvPlot, before

Things worked partially, i.e. plotting all the columns at once worked while to specify a given column you'd have to find out its string representation.

df.hvplot()  # Works

df.hvplot(y=1)  # Error!
df.hvplot(y='1')  # Works!

dft.hvplot()  # Works
df.hvplot(y=pd.Timestamp('2022/01/01'))  # Error!
df.hvplot(y='2022-01-01 00:00:00')  # Works!

hvPlot, now

All the examples above fail with a DataError. What happens is that at line L1226 a DataError is raised (also at L1224 but it's caught) when trying to instantiate the hv.Dataset, that is given self.source_data. If it was given self.data instead no DataError would be raised as the columns of self.data are converted to strings the _transform_columnar_data method. I'm not sure why hv.Dataset is given self.source_data.

https://github.com/holoviz/hvplot/blob/6f6da2d2f39970bb3ea0731641c4038bc934ec27/hvplot/converter.py#L1210-L1226

Solutions

The current state is clearly a regression, compared to a previous state that wasn't already ideal.

HoloViews should:

[ ] not just check for integers before raising a DataError at it seems that Pandas allows more than strings and integers as column names
maybe add support to non-string column names, if that's even technically possible

Until 2. happens, a solution should be found for hvPlot itself:

[ ] Untangle the usage of self.source_data and self.data I referred to above, to avoid the DataError when calling df.hvplot()
[ ] To allow users to reference the actual column name (e.g. df.hvplot(y=1)) it may be required to record a mapping of the original column names with their string representation.

Feedbacks on this @jlstevens @philippjfr ?

Note that it also broke the following but I expect the potential hvPlot fixes listed above to fix that too:

s = pd.Series(np.random.rand(10))
s.hvplot()

Sep 28 '22 08:09 maximlt

Does hvplot make any claims about how the .data of the elements it generates relates to the original source (e.g. when using df.hvplot)?

If no claims are made (which I believe is the case) then imho this should be fixed at the hvplot level by decoupling the input data from the data in the output (which surely can be done efficiently without copying?). As hvplot supports other data types with named dimensions (xarray) I would expect a consistent entrypoint to accessing dimension data where the necessary remapping could take place.

Sep 28 '22 09:09 jlstevens

I just noticed that we did not get all Holoviews FutureWarning in #932.

import hvplot.pandas
import pandas as pd

pd.DataFrame([1, 2, 3, 4]).interactive()

/home/shh/Development/holoviz/repos/hvplot/hvplot/interactive.py:272: FutureWarning: Having a non-string as a column name in a DataFrame is deprecated and will not be supported in Holoviews version 1.16.
  ds = hv.Dataset(self._obj)

Dec 16 '22 14:12 hoxbro

Oh indeed! Thanks for reporting that!

Dec 19 '22 10:12 maximlt