param icon indicating copy to clipboard operation
param copied to clipboard

accept duckdb for param.DataFrame

Open matipos2 opened this issue 10 months ago • 11 comments

Is your feature request related to a problem? Please describe.

With panel I'm building a app. The data for the app comes from the duckdb. I'm implementing the DataStore class. The data attribute is built as data = param.DataFrame(). The problem is that param.DataFrame accepts only pandas.DataFrame, but with duckdb I have DuckDBPyRelation, it could be also DuckDBConnection.

Describe the solution you'd like

I would like it to work like in panel and hvplot. So I can use the same supported source types with params like in the other parts of the holoviz ecosystems.

Describe alternatives you've considered

You can always convert the data object into pandas, but it's not what you always want. What if you have huge data volume and you don't want to put all of it into memory?

Additional context

It would be nice to have fully consistency between param and panel, hvplot. So, if I can have anydf.hvplot.any_plot(..), why I can't have data = param.DataFrame(anydf)?

matipos2 avatar Feb 28 '25 17:02 matipos2

Yes, we should support Polars, Modin, Dask, and other DataFrame types here as well. The main thing tying it to Pandas is:

        from pandas import DataFrame as pdDFrame
        super().__init__(default=default, class_=pdDFrame, **params)

Not sure how best to extend this to handle multiple types. It also may or may not be important for any particular use case to be able to specify which class is expected; sometimes any DataFrame will work, and sometimes a user will want to ensure that only the type they know how to handle will be accepted. These are solvable issues, but someone would need to work on them a bit.

Maybe the easiest way to deal with both issues is to accept a class_ argument to the DataFrame constructor, so that we import Pandas only if a user hasn't explicitly specified a class. And then users could specify a single DataFrame class or multiple classes.

Alternatively, we could switch to duck typing, and look for the presence of various standard DataFrame methods, rather than checking the class. Anyone have thoughts on that?

jbednar avatar Feb 28 '25 19:02 jbednar

Use Narwhals. Exactly how I dont know but its made for solving these issues.

MarcSkovMadsen avatar Mar 01 '25 17:03 MarcSkovMadsen

cc https://github.com/holoviz/param/pull/975

maximlt avatar Mar 03 '25 15:03 maximlt

Narwhals is great, but I'm not sure it's that helpful here. Param doesn't have any dependencies outside the standard library, and adding a global dependency on Narwhals wouldn't be appropriate. Even requiring Narwhals just if this one Parameter is used would be odd for Pandas; most Pandas users aren't going to want to add a dependency on Narwhals if they don't use it. There could be some argument for using Narwhals for everything other than Pandas. But still Narwhals seems like a very heavy duty solution for this issue, since it focuses on recreating the API for non-Polars dataframes to match the Polars API, while here we are not actually using any of the DataFrame API, just validating the class. So I don't think we have the problem that Narwhals is addressing, and don't think we can add a dependency on it in any case.

@philippjfr 's PR#975 is closer, but it hard codes support for only a couple of different types, and it seems like we could address this in a more future-proof way that doesn't enumerate the possible values within the Parameter type. Focusing on the user of the DataFrame Parameter (i.e. the person who instantiates it into one of their classes):

  1. They could pass a list of string names of dataframe libraries to allow (and with each one optionally the name of the dataframe type name(s) to accept (if not DataFrame).
  2. They could specify the dataframe name only (no library), which would accept anything called DataFrame
  3. They could pass in the actual types accepted when constructing the Parameter

At the Param level, any of these would work. But e.g. option 3 probably wouldn't work well for libraries like Panel that might want to add support for some DataFrame type but not import it by default to avoid adding a dependency on it.

In any case there are now so many DataFrame libraries that I'd like a solution where we don't need to re-release Param whenever a new one needs to be added, and instead can just let the user configure it how they like.

jbednar avatar Mar 04 '25 05:03 jbednar

Narwhals provides the validation of a broad and growing list of dataframe libraries. I see the alternative proposed as an implemention of own validation as introducing an own "standard" with its pros (independence) and cons (nobody knows it, it requires work)

Downstream libraries (like Panel and HoloViz) would be able to utilize the types and apis provided by Narwhals.

Bokeh will also start supporting Narwhals.

MarcSkovMadsen avatar Mar 05 '25 02:03 MarcSkovMadsen

Downstream libraries are in a different situation, in that they actually use the DataFrame involved. Param doesn't use it; it simply provides a container that can hold it. So all of the work done in Narwhals to provide a compatible API layer across these libraries is not relevant for Param. Ideally we can have good support for Narwhals if it's available, but I don't see it as solving any problem that Param has, even if it solves lots of problems for downstream libraries like Panel and HoloViews.

jbednar avatar Mar 07 '25 16:03 jbednar

Param will have to implement validation. I believe this will redo what Narwhals provides for a growing list of data frame libraries.

See https://github.com/panel-extensions/panel-graphic-walker/blob/main/src/panel_gwalker/_tabular_data.py.

MarcSkovMadsen avatar Mar 07 '25 18:03 MarcSkovMadsen

Hopefully @philippjfr can chime in to talk about downstream usage of this feature and what implications it might have for how that's best handled at the Param level.

jbednar avatar Mar 07 '25 20:03 jbednar

My feeling here is that we should continue to support pandas validation without requiring narwhals (because anything else is backward compatibility breaking), but for all other DataFrame libraries we will use narwhals for the validation part. I'd also suggest that we do not simply allow passing another DataFrame library to a DataFrame parameter by default, because again, that'd be backwards compatibility breaking. Precisely how users should be able to specify the allowable DataFrame libraries I'm not sure yet, but would love to hear suggestions.

philippjfr avatar Mar 07 '25 20:03 philippjfr

Agreed; a DataFrame Parameter should not accept anything but a Pandas DataFrame by default, because that has been the semantics so far.

jbednar avatar Mar 10 '25 14:03 jbednar

One idea to consider is whether it should be a Tabular parameter? To support tables in general. For example duckdb tables.

MarcSkovMadsen avatar Mar 11 '25 05:03 MarcSkovMadsen