skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Allow storing preview value in skrub.var as a default value

Open Vincent-Maladiere opened this issue 5 months ago • 12 comments

During a test session of the Data Ops, @ogrisel and @glemaitre produced a Learner (currently, using .skb.get_pipeline) and were wondering how to obtain the list of keys or schema input for the learner.

A good proxy could be:

>>> pipeline.expr.skb.get_data()
{}

However, it is empty because we only return nodes bound with values, and we strip these values when calling skb.get_pipeline.

Therefore, we could add a method to the expression namespace and/or the Learner like get_data_schema(), which would return the list of keys. Ideally, we could also return some metadata like the type, and some optional metadata linked to the type (e.g., the columns and their dtypes for a dataframe).

Another idea that popped up is to set default values to variable, so that the learner can store them for later use. For instance, if we pass a directory path as a variable, it could be nice to store it as a default, passing it as an input to the Learner would become optional:

>>> skrub.var("path_directory", "my/path/dir", store_default=True).skb.get_pipeline().fit(environment={})

WDYT?

Vincent-Maladiere avatar Jul 08 '25 16:07 Vincent-Maladiere

I really like the first proposal (get_data_schema())

I also like the second, but I am wondering if setting a default value for what then becomes X may lead to misuse 🤔 I also think it could lead to problems when someone gets the learner and isn't aware of the default value, then tries to run the thing and everything crashes

I think I'd rather force the user to be explicit with the variables instead.

rcap107 avatar Jul 08 '25 17:07 rcap107

Interesting issue!

I would maybe vouch for "get_data_name" above "get_data_schema", as "schema" is a bit jargon. But it is a detail

GaelVaroquaux avatar Jul 08 '25 17:07 GaelVaroquaux

I also think it could lead to problems when someone gets the learner and isn't aware of the default value, then tries to run the thing and everything crashes

A user not aware of a default value would overwrite it, so the main concern is more about storing large dataframes inside the model. My guess is that users who are advanced enough to use this feature would not save large dataframes by mistake, but we can discuss this second idea in more depth.

Vincent-Maladiere avatar Jul 08 '25 18:07 Vincent-Maladiere

Thanks @Vincent-Maladiere those are both great ideas.

for the 1st one, I would go for something that returns just a list of names, or that returns a mapping from name to the var() objects created by the user. eg e.skb.get_var_names() (or get_data_names, or something else) -> ['products', 'baskets'] or get_vars() (method name also tbd) -> {'products': <the object returned by skrub.var('products')>, ...}. indeed the var() object contains all the possible metadata we could gather, and has all the usual .skb methods such as .skb.preview(), .skb.is_X() etc. I would not introduce a new type or datastructure to represent the metadata. also when there is no preview value the get_schema would have no way to infer a type etc. But apart from those details +1 it is clearly a great addition. We could also have a get_all_names() returning not only variables but all the names that allow setting a value in the environment, such as the names of choices or of nodes on which we called .skb.set_name().

for the 2nd one, I remember hesitating to put default values in variables. the reason I decided against it for the first version is that the concept of the preview value is already hard to grasp at first, because it doesn't really have equivalents in other tools. I thought adding a default value, which sounds quite similar (some kind of optional placeholder that is used when we don't provide an explicit value) but is actually something entirely different, would make that step even harder. even if we don't discuss it in the intro users still see it in the signature and reference doc. and because vars and their preview values must be introduced right away I thought reducing the confusion was more important than the convenience of default values.

that being said, I did hesitate and the example you provide is compelling. if we think of the expression/dataplan as some kind of tunable&fittable function, having default parameter values seems very natural and convenient. so maybe it should be added indeed.

Note that at the moment it is acheivable with a trick (which no one will think of but could be documented, if we are considering this is for advanced users): instead of giving a default to a variable, give a name to a literal value skrub.as_expr(value).skb.set_name(name). I guess as_expr will become as_data_op

>>> e = skrub.as_expr(10).skb.set_name('a')
>>> e.skb.eval()
10
>>> e.skb.eval({'a': 20})
20

jeromedockes avatar Jul 09 '25 20:07 jeromedockes

we could even have a .skb.make_docstring(summary) that collects all the variable names, descriptions and any relevant info and produces a text similar to the docstring you would write if the expression was a handwritten function or class 🙃

jeromedockes avatar Jul 09 '25 20:07 jeromedockes

about the name for get_var_names(), get_data_schema() etc whatever name we pick should be quite different from get_params() to avoid confusing it with the scikit-learn get_params

jeromedockes avatar Jul 09 '25 20:07 jeromedockes

Very nice (and fun) trick!

Vincent-Maladiere avatar Jul 10 '25 09:07 Vincent-Maladiere

I'm working on the first idea

Vincent-Maladiere avatar Jul 10 '25 13:07 Vincent-Maladiere

I'm working on the first idea

cool :) you can get the nodes easily with _evaluation.nodes:

>>> import skrub
>>> from skrub._expressions._evaluation import nodes, Var
>>> e = skrub.X() + skrub.y()
>>> {impl.name: n for n in nodes(e) if isinstance((impl := n._skrub_impl), Var)}
{'X': <Var 'X'>, 'y': <Var 'y'>}

jeromedockes avatar Jul 10 '25 14:07 jeromedockes

why close the issue? the first part (getting the variable names) has been implemented, but from the discussion there doesn't seem to be a conclusion on the second part (store_default in variables)?

jeromedockes avatar Nov 03 '25 09:11 jeromedockes

Same here, I closed this by mistake

rcap107 avatar Nov 03 '25 10:11 rcap107

renamed the issue as the part that was in the title has been implemented already

jeromedockes avatar Dec 06 '25 18:12 jeromedockes