Handling of non-string DataFrame column names
The test suite is currently broken on master after the latest dev release of HoloViews. The breakage is due to https://github.com/holoviz/holoviews/pull/5354 that raises an error earlier than before if the column names of a DataFrame contain an integer.
It means that in the current state this breaks white it used to work:
import numpy as np; import pandas as pd
import hvplot.pandas
df = pd.DataFrame(np.random.rand(10, 2)) # columns is [0, 1]
df.hvplot()
and raises:
DataError: pandas DataFrame column names used as dimensions must be strings not integers.
PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html
Pandas .plot
Pandas .plot is actually very flexible on the column names it accepts, these all work:
df.plot()
df.plot(y=1)
dft = pd.DataFrame(np.random.rand(10, 2), columns=[pd.Timestamp('2022/01/01'), pd.Timestamp('2023/01/01')])
dft.plot()
dft.plot(y=pd.Timestamp('2022/01/01'))
hvPlot, before
Things worked partially, i.e. plotting all the columns at once worked while to specify a given column you'd have to find out its string representation.
df.hvplot() # Works
df.hvplot(y=1) # Error!
df.hvplot(y='1') # Works!
dft.hvplot() # Works
df.hvplot(y=pd.Timestamp('2022/01/01')) # Error!
df.hvplot(y='2022-01-01 00:00:00') # Works!
hvPlot, now
All the examples above fail with a DataError. What happens is that at line L1226 a DataError is raised (also at L1224 but it's caught) when trying to instantiate the hv.Dataset, that is given self.source_data. If it was given self.data instead no DataError would be raised as the columns of self.data are converted to strings the _transform_columnar_data method. I'm not sure why hv.Dataset is given self.source_data.
https://github.com/holoviz/hvplot/blob/6f6da2d2f39970bb3ea0731641c4038bc934ec27/hvplot/converter.py#L1210-L1226
Solutions
The current state is clearly a regression, compared to a previous state that wasn't already ideal.
HoloViews should:
- [ ] not just check for integers before raising a
DataErrorat it seems that Pandas allows more than strings and integers as column names - maybe add support to non-string column names, if that's even technically possible
Until 2. happens, a solution should be found for hvPlot itself:
- [ ] Untangle the usage of
self.source_dataandself.dataI referred to above, to avoid theDataErrorwhen callingdf.hvplot() - [ ] To allow users to reference the actual column name (e.g.
df.hvplot(y=1)) it may be required to record a mapping of the original column names with their string representation.
Feedbacks on this @jlstevens @philippjfr ?
Note that it also broke the following but I expect the potential hvPlot fixes listed above to fix that too:
s = pd.Series(np.random.rand(10))
s.hvplot()
Does hvplot make any claims about how the .data of the elements it generates relates to the original source (e.g. when using df.hvplot)?
If no claims are made (which I believe is the case) then imho this should be fixed at the hvplot level by decoupling the input data from the data in the output (which surely can be done efficiently without copying?). As hvplot supports other data types with named dimensions (xarray) I would expect a consistent entrypoint to accessing dimension data where the necessary remapping could take place.
I just noticed that we did not get all Holoviews FutureWarning in #932.
import hvplot.pandas
import pandas as pd
pd.DataFrame([1, 2, 3, 4]).interactive()
/home/shh/Development/holoviz/repos/hvplot/hvplot/interactive.py:272: FutureWarning: Having a non-string as a column name in a DataFrame is deprecated and will not be supported in Holoviews version 1.16.
ds = hv.Dataset(self._obj)
Oh indeed! Thanks for reporting that!