hvplot icon indicating copy to clipboard operation
hvplot copied to clipboard

.interactive dataframe slice selection by row index or position not possible

Open JanHomann opened this issue 1 year ago • 20 comments

Is your feature request related to a problem? Please describe.

I was in the assumption that the .interactive interface for dataframes should mimic the interface of pandas. Yet not everything seems to work.

Given the code

import hvplot.pandas
import panel.widgets as pnw
from bokeh.sampledata import airports
df = airports.data
dfi = df.interactive()

p = pnw.IntSlider(start=10, end=40)
dfi[5:p]

I would like to see a slider and a dynamic dataframe representation showing index 5...p-1 (that dataframe has an integer index). Instead I get a static output without slider. I can make one in another cell, but when I move it, the dataframe output does not update. I need to execute the cell manually again to get the update.

This here doesn't work at all and gives an InvalidIndexError

dfi.loc[5:p,:]
# InvalidIndexError: IntSlider(end=40, start=10, value=10)

Those here give no slider and don't update when I create a slider in another cell:

dfi.loc[:p]

dfi.iloc[5:p]
dfi.iloc[slice(5,p)]
dfi.values[5:p]

This here gives a type error

dfi.loc[dfi.index[5:p]]
# TypeError: slice indices must be integers or None or have an __index__ method

A current workaround for dfi.loc[5:p,:] is:

dfi[(dfi.index>=5) & (dfi.index<p)]

For dfi.iloc[5:p] I have no reasonable workaround (except for resetting the index, selecting via above method and setting it again).

Neither dfi.values[5:p] nor dfi.loc[dfi.index[5:p]] as a workaround work, and I don't know what else to try. Literally the first example in the interactive tutorial (https://hvplot.holoviz.org/user_guide/Interactive.html) has indexing by position in an xarray with .isel, which I think i have tried and it works. So it is surprising to me that there is no mechanism to select a positional range in dataframes.

JanHomann avatar Jul 07 '22 20:07 JanHomann

Here are my package versions:

hvplot                : 0.8.0
pandas                : 1.4.3
holoviews             : 1.14.9
bokeh                 : 2.4.3


Python version        : 3.9.13
IPython version       : 8.4.0
jupyter notebook      : 6.4.12
jupyterlab            : 3.4.3
    
OS                    : Darwin
Release               : 20.6.0
Browser               : Safari

JanHomann avatar Jul 07 '22 20:07 JanHomann

For .interactive to work with widgets embedded inside other objects passed into the dataframe methods, we have to do special work to create proxy objects that fetch current values from the widget before invoking the underlying object. See e.g. https://github.com/holoviz/holoviews/pull/5184#issuecomment-1019592627 , where we were discussing lists and dicts that get passed in. Here, we might need special support for a slice object?

jbednar avatar Jul 07 '22 20:07 jbednar

Yes, I suppose the mechanism by which this works in Pandas is that a slice object is passed to the dataframe method. Makes sense. Is there currently a workaround for positional indexing?

JanHomann avatar Jul 07 '22 21:07 JanHomann

Here is somewhat of a workaround. Unfortunately that only works when you start out with a dataframe, not with an interactive dataframe. So essentially if you know that you need positional indexing in your interactive dataframe, you have to do it right when creating it from the normal dataframe if you use this method. Passing an interactive dataframe into this mechanism (like with dfi for the df parameter) doesn't work.

end_ind = pnw.IntSlider(start=10, end=40)

def make_slice(df=df, end_ind=20):
    return df.iloc[5:end_ind]

dfi = hvplot.bind(make_slice, df=df, end_ind=end_ind).interactive(width=600)
dfi   # works as expected with dynamic slider

JanHomann avatar Jul 07 '22 22:07 JanHomann

Yep, that's a good workaround for now. I think having it "just work" is a good feature request. @Hoxbro, something you could add?

If anyone knows of any other special objects like this, it would be good to address those as well...

jbednar avatar Jul 07 '22 22:07 jbednar

I got it to work with slices, and will submit PRs in holoviews and hvplot today.

https://user-images.githubusercontent.com/19758978/177975269-75bdf24f-7fee-4f08-a17d-906b0a5cf9d6.mp4

hoxbro avatar Jul 08 '22 10:07 hoxbro

Wow, you guys are amazing. A solution in less than 24 hours?!

JanHomann avatar Jul 08 '22 19:07 JanHomann

Nice! @JanHomann , a really valuable contribution you (or other users) could make is to study the Pandas API and see if there are any other collections or special objects that would need similar treatment. I think we currently handle slices, dicts, and lists; not sure if there special iterators or other objects that we should be looking out for...

jbednar avatar Jul 08 '22 19:07 jbednar

Nice! @JanHomann , a really valuable contribution you (or other users) could make is to study the Pandas API and see if there are any other collections or special objects that would need similar treatment. I think we current now handle slices, dicts, and lists; not sure if there special iterators or other objects that we should be looking out for...

Thank you! Another thing I have noticed is that dataframe functions don't give back the correct type. For example:

p = pnw.IndexSlider()
dfi.columns(p)

doesn't return a string (the column name). Instead it returns another interactive object, that cannot be used for example for indexing. So:

dfi[dfi.columns[p]]

doesn't work because it gets the wrong type.

JanHomann avatar Jul 08 '22 21:07 JanHomann

Here is another problem. The .query() method isn't working . This function can be replicated by boolean indexing, but it's more concise and fast.

import hvplot.pandas
import panel.widgets as pnw
from bokeh.sampledata import antibiotics

df = antibiotics.data
dfi = df.interactive()

p = pnw.IntSlider(value=5, start=1, end=10)
dfi.query('penicillin < @p')              # TypeError  (should return all the rows where the column `penicillin` is < 5)

In this case the widget information needs to be embedded in a string object I suppose.
f-strings are another thing. This example doesn't work:

dfi.query(f'penicillin < {p}')            # TypeError 

f-string embedding would be great, because people do stuff like this for column indexing:
dfi[f'col_{p}']                           # doesn't work

JanHomann avatar Jul 08 '22 21:07 JanHomann

And then there is this: Assignment fails.

Screen Shot 2022-07-08 at 6 01 38 PM

JanHomann avatar Jul 08 '22 22:07 JanHomann

Overall I totally love the holoviz stack. I think it's currently the most advanced and most user friendly stack for interactive plotting in python.

JanHomann avatar Jul 08 '22 22:07 JanHomann

Today I found some more cases where .interactive fails. The first case is lambda functions (and probably also normal functions) that some Pandas methods accept as arguments.

import numpy as np
import pandas as pd
import hvplot.pandas
import panel.widgets as pnw


df = pd.DataFrame(data=np.random.randn(10,3))
p = pnw.IntSlider(start=1, end=10, value=5)
dfi = df.interactive()

df.apply(lambda x: x*5)   # this works
dfi.apply(lambda x: x*p)  # this doesn't

JanHomann avatar Jul 09 '22 21:07 JanHomann

The second case is range objects, which can also be passed to some Pandas methods.

import numpy as np
import pandas as pd
import hvplot.pandas
import panel.widgets as pnw


df = pd.DataFrame(data=np.random.randn(10,3))
p = pnw.IntSlider(start=0, end=9, value=5)
dfi = df.interactive()

df.isin(range(1,3))        # this works (returns a boolean dataframe)
dfi.isin(range(1,p))       # this doesn't
# TypeError: 'IntSlider' object cannot be interpreted as an integer

I think many Pandas methods that natively support lists probably also support range objects.

JanHomann avatar Jul 09 '22 21:07 JanHomann

Generator objects can also be passed to some Pandas methods (probably again many Pandas methods that take a list).

import numpy as np
import pandas as pd
import hvplot.pandas
import panel.widgets as pnw

df = pd.DataFrame(data=np.random.randn(20,3))
dfi = df.interactive()
p = pnw.IntSlider(start=1, end=4, value=2)
g = (n**2 for n in range(5))
g1 = (n**2 + p for n in range(5))               # panel widget IntSlider is embedded in a generator

df.loc[g,:]             # this works
dfi.loc[g1,:]           # this one doesn't:    
# TypeError: unsupported operand type(s) for +: 'int' and 'IntSlider'

JanHomann avatar Jul 09 '22 21:07 JanHomann

So there are:

  • f-strings
  • normal strings with @parameter for .query and .eval
  • range objects
  • (lambda) functions
  • generators

And then there is the problem of passing the output of an interactive method back into an interactive dataframe. That is a reasonable thing to do with dataframes, for example in the case of df[df.columns[p]].

Currently it seems the output of an interactive dataframe is another interactive dataframe, but many dataframe methods don't return a dataframe, but a string or a tuple or a Series, all of which then can be used as a parameter for another dataframe method, for example when doing filtering or grouping.

JanHomann avatar Jul 09 '22 22:07 JanHomann

Another thing that I just checked that doesn't work, is having a slider in a pd.Timestamp object, which can also be something that is understood by Pandas as input for a computation.

from bokeh.sampledata import daylight
df = daylight.daylight_warsaw_2013
dfi = df.interactive()
p = pnw.IntSlider(start=1, end=28, value=10)

df[df.Date < pd.Timestamp(year=2013, month=6, day=5)]     # This works
dfi[dfi.Date < pd.Timestamp(year=2013, month=6, day=p)]   # this one doesn't
#TypeError: an integer is required (got type IntSlider)

Probably a slider in a pd.TimeDelta object wouldn't work either, but I haven't checked that.

JanHomann avatar Jul 09 '22 22:07 JanHomann

A few Pandas methods also can work with an pd.Interval object as an input. For example, if you have an pd.IntervalIndex you can filter it with an pd.Interval object. Pandas has rather mediocre support for intervals, but for example pd.cut() and pd.qcut() (used for binning continuous data) return intervals for each binned piece of data that then can be further processed with .groupby() to get statistics on those bins. .groupby() can operate on intervals.

df = pd.DataFrame(data=np.random.randn(6,3), 
                  index=pd.IntervalIndex.from_tuples( [ (-1.5,1), (-1,-0.5), (-0.5,0), (0,0.5), (0.5,1), (1,1.5) ] ))
dfi = df.interactive()
p = pnw.IntSlider(start=1, end=3, value=2)

df.index.overlaps(pd.Interval(0.2,1))    # this one works (returns a 1d boolean array)
dfi.index.overlaps(pd.Interval(0.2,p))   # this one doesn't 
# ValueError: Only numeric, Timestamp and Timedelta endpoints are allowed when constructing an Interval.

If the index of df is an pd.IntervalIndex, then this here works for dataframes:

df.index.isin([5])    # returns a boolean array with elements true where 5 overlaps with the intervals in the index
dfi.index.isin([p])   # this actually works. returns a working slider and a boolean array in the same way as the static version
dfi.loc[p,:]          # this here works too, nice. It returns the row where the interval index overlaps with the value of the slider p.

This here fails.

df.index.contains(1)      # returns a 1d boolean array
dfi.index.contains(p)     # does not work
# AttributeError: 'Interactive' object has no attribute 'contains'

Plotting with an interval index also fails.

df.plot()       # works with an interval index
df.hvplot()     # type error  (this is a normal dataframe and not an interactive one)
# TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

JanHomann avatar Jul 09 '22 23:07 JanHomann

So in summary, currently unsupported as parameters are:

  • f-strings
  • embedding variables with @ for .query and .eval
  • range objects
  • (lambda) functions
  • generators
  • pd.Timestamp
  • pd.TimeDelta
  • pd.Interval
  • item assignment
  • feeding the output of an interactive dataframe method into another interactive dataframe method

JanHomann avatar Jul 11 '22 05:07 JanHomann

Great work @JanHomann! I will look into what is possible to add to interactive.

hoxbro avatar Jul 11 '22 11:07 hoxbro

@Hoxbro The original problem seems solved now. So should this stay open? Or should the title be renamed, because we figured that there are other DataFrame parameter types that are currently not supported by interactive dataframes?

JanHomann avatar Jul 10 '23 20:07 JanHomann

Let's keep this open with a title rename.

We have begun rewriting .interactive to be more generic. When that is in place, we will revisit the suggestion made in this thread.

hoxbro avatar Jul 11 '23 07:07 hoxbro