skrub icon indicating copy to clipboard operation
skrub copied to clipboard

RFC: Naming of the functions

Open MarieSacksick opened this issue 5 months ago • 29 comments

Hello! During the sprint yesterday, I used skrub's data ops and data plan for the first time.
I was a bit puzzled at first, but after ~15/30 minutes, it became clear how it works, a node being added at each apply, on top of graph (or other node).

Quickly, I made what is apparently a classic typo, forgetting the .skb for the second apply:
X.skb.apply_func(pd.DataFrame).apply(vectorizer, y=y)

The error rendered is:

RuntimeError: Evaluation of '.apply()' failed.

Which is not clear. Adding something like "you might have forgotten the .skb" would certainly help, if it's going to be a frequent error.
Yet, in my opinion, it's fixing the consequence but not the root cause.
@Vincent-Maladiere explained my the idea of namespace. I understand the idea, but it's not something I'm familiar with.
As a data-scientist-before-being-software-engineer, here are my pain points with having to do .skb:

  • is something really weird and unnatural,
  • I don't fully understand what I am doing; am I applying a method to transform my object into a skrub object? Isn't it supposed to be one already?
  • I don't think I can remember that easily. Even with a lot of skrub practice, I think that I will forget it often (definitely more than 30% of the time).

A solution I would like to suggest is to prefix the functions with skb_ directly in their names. The above code would become:
X.skb_apply_func(pd.DataFrame).skb_apply(vectorizer, y=y)

This has the following benefit in my opinion:

  • it's clear
  • there is autocomplete to see all the functions available
  • less error when coding
  • the prefix indicates that it's part of the same kind of functions that should be used in a given (= no benefit loss from .skb)

TL; DR

Not going to list more bullet points to compare the solution .skb with skb_ because I'm pretty sure you explored them already; mostly what I want to say is that as a data-scientist-before-being-software-engineer, I prefer the later.

MarieSacksick avatar Jul 10 '25 08:07 MarieSacksick

Thanks @MarieSacksick for opening the issue!

For my part, I can say that I had the same experience of constantly forgetting the .skb even after spending quite a lot of time working with the namespace. I can definitely see this as being a common difficulty for people that are starting out with the DataOps.

rcap107 avatar Jul 10 '25 08:07 rcap107

I read this with interest (what matters to me is the feeling of users).

I'm curious, do y'all use the ".dt" namespace of pandas DataFrames, or the ".str"? I find them really powerful and useful

GaelVaroquaux avatar Jul 10 '25 10:07 GaelVaroquaux

I never used them.
If I had to do what they do in their examples, I would use an apply I think, and the lib datetime.

MarieSacksick avatar Jul 10 '25 11:07 MarieSacksick

I do use both .dt and .str when I need, but I think that what makes the difference for me is that they are applied to columns and within dataframes

Example with polars (because I don't remember pandas)

df.with_columns(pl.col("a").str.lower())

So here I am doing something "to the column"

In the DataOps case instead, what I have in mind is a dataframe that is being modified, so I am just expecting to have access to the functions directly from the object

df.apply(whatever)
# rather than
df.skb.apply(whatever)

Another observation is that, when I'm working with columns, .str, .dt, .list are all separate namespaces that let you do different things in the same object, while with the DataOps there is only one namespace that is doing everything, so it feels a bit redundant and an unnecessary step to access the functions.

rcap107 avatar Jul 10 '25 11:07 rcap107

thanks @MarieSacksick , can you share the full example? The error message I get does suggest .skb might be missing:

>>> import numpy as np
>>> import pandas as pd
>>> import skrub
>>> X = skrub.X(np.eye(3))
>>> X.skb.apply_func(pd.DataFrame).apply(skrub.TableVectorizer())
Traceback (most recent call last):
    ...
TypeError: Calling `.apply()` with an estimator: `TableVectorizer()` failed with the error above. Did you mean `.skb.apply()`?

I agree we should wrap the line because in browser-based environments that scroll the output that part message gets hidden. Note the message above is a special-case because pandas dataframes also have an apply method, the generic message is

>>> import skrub
>>> skrub.X(0).apply(skrub.TableVectorizer())
Traceback (most recent call last):
    ...
RuntimeError: Evaluation of '.apply' failed.
You can see the full traceback above. The error message was:
AttributeError: 'int' object has no attribute 'apply'.
Did you mean '.skb.apply'?

this one is wrapped and easier to read.

The skrub methods are grouped in the .skb attribute because the data op wraps another python object while still giving access to all the wrapped object's attributes. So if we have X = skrub.X(df) we can still do X.merge(...), X.iloc[10] or X['title'].str.strip().str.lower(). We do not want to flood the object's namespace with many skrub names, which is why the skrub functions are collected behind a single access point .skb. We still have autocompletion, of the object's own attributes when completing on X., and of all the skrub functions when completing on X.skb., while avoiding to mix the 2.

Grouping related names in separate namespaces is common, one of the reasons why we have modules (itertools.count and not itertools_count), and pandas, polars and other datascience packages (including skore) expose core functionality behind similar "accessors", eg .iloc, .str, .list, .plot etc. When wrapped in a skrub dataop there is one more such accessor which is .skb and all the other ones are still available.

I agree it is easy to forget skb. and type apply instead of skb.apply but then it might also be possible to forget skb_ and type apply instead of skb_apply

jeromedockes avatar Jul 10 '25 20:07 jeromedockes

In the DataOps case instead, what I have in mind is a dataframe that is being modified, so I am just expecting to have access to the functions directly from the object

because we want to still give access to the wrapped object's attributes, they have to be separated. In your example if we followed df.apply instead of df.skb.apply it would clash with pandas.DataFrame.apply. There is not only one namespace that does everything, because it gets added rather than replacing the wrapped object's namespace

jeromedockes avatar Jul 10 '25 20:07 jeromedockes

One thing that could be considered, if the attributes and methods added by skrub are accessed much more than those of the wrapped object, is to swap the roles, have the skrub names at the top level and group the other ones in something like raw or wrapped eg df.cross_validate() and df.raw.groupby or df.wrapped.groupby. (a minor benefit is it would allow the wrapped object to have its own skb attributes but I can't think of any that do). The reason it was done in this way was to treat the original object's methods as the 'first-class citizens' and the easiest to access, but if it turns out that we actually use the .skb ones a lot more it might make sense to swap. either way I wouldn't merge the 2

jeromedockes avatar Jul 10 '25 20:07 jeromedockes

In the DataOps case instead, what I have in mind is a dataframe that is being modified, so I am just expecting to have access to the functions directly from the object

because we want to still give access to the wrapped object's attributes, they have to be separated. In your example if we followed df.apply instead of df.skb.apply it would clash with pandas.DataFrame.apply. There is not only one namespace that does everything, because it gets added rather than replacing the wrapped object's namespace

Yes I remember there would be collisions, in the example I should have written .skb_apply instead of .apply

rcap107 avatar Jul 10 '25 20:07 rcap107

One thing that could be considered, if the attributes and methods added by skrub are accessed much more than those of the wrapped object, is to swap the roles, have the skrub names at the top level and group the other ones in something like raw or wrapped eg df.cross_validate() and df.raw.groupby or df.wrapped.groupby. (a minor benefit is it would allow the wrapped object to have its own skb attributes but I can't think of any that do). The reason it was done in this way was to treat the original object's methods as the 'first-class citizens' and the easiest to access, but if it turns out that we actually use the .skb ones a lot more it might make sense to swap. either way I wouldn't merge the 2

I am thinking this might be a better approach, but if we go with this solution we might want to use deferred functions more.

Personally, I tend to use apply more often than not, and usually I wrap complex dataframe functions in deferred functions anyway, so I don't use the actual dataframe functions directly very often.

rcap107 avatar Jul 10 '25 20:07 rcap107

We do not want to flood the object's namespace with many skrub names

I guess @MarieSacksick doesn't share this POV, hence her feedback.

One thing that could be considered, if the attributes and methods added by skrub are accessed much more than those of the wrapped object, is to swap the roles

I think it would make things even worse. You would have to invoke this namespace twice for a simple groupby operation:

df.raw.groupby("col").raw.mean().raw.reset_index()

Instead, following the suggestion above looks like a path we should explore, especially if some people explicitly ask for it.

If we prefix all methods with skb_, we're guaranteed that they won't clash with original methods (skb_apply vs apply), and that these methods will be next to each other when listing them using dir, for example.

Vincent-Maladiere avatar Jul 11 '25 11:07 Vincent-Maladiere

I agree with you the raw probably is worse than skb, especially since it would not be the actual value but still a lazy object, so it would be super confusing after all.

I still think we should prefix with skb. not skb_ -- separating names into namespaces whose members are accessed with . is done all the time. I cannot think of many examples in Python where namespaces are emulated by adding a prefix directly to the member names.

I'm not sure why it is better to have the skrub names and pandas names in the same dir output rather than in dir(op) and dir(op.skb) respectively.

@rcap107 I agree about the use of deferred, maybe we should use it more in examples rather than having many operations directly on the objects.

To go even further in that direction we could also consider removing entirely the dynamic attributes and operators, ie we cannot manipulate the data op as if it was a dataframe or numpy array, all the code has to be in a deferred function or a scikit-learn estimator. I think it would make the code easier to reason about by distinguishing more clearly eager and deferred operations, making it easier to know without context if the code we read manipulates an data op or a normal object, but a big drawback is it would make some simple things much more verbose, forcing the user to define a lot of tiny single-use functions

jeromedockes avatar Jul 11 '25 17:07 jeromedockes

I thought a bit more about this issue, and I think that keeping the namespace is the best solution, provided that it is explained well enough in the starting guides and in any talk we give about the DataOps: I realized late that I did not have anything about the namespace in the slide deck, and I think that did not help with understanding how to use the function.

I also think that none of the proposed solutions is clearly better (as in, easier to understand) than the others, so I am worried that we might end up making changes that do not actually improve the situation.

So my current position is to keep things as they are, and to make a conscious effort to explain the .skb namespace better in slides and guides.

rcap107 avatar Jul 13 '25 22:07 rcap107

I'm ok with whatever, but I'm concerned with how little we care about users' feedback. Instead of trying to understand their perspective, we tell them how they should think. From the start, we've been working on expressions in submarine mode, dedicating an insane amount of time without ever asking anyone if our efforts would solve their problems or if they see value in them. As a result, we don't have a clear roadmap, and we spend a lot of time nitpicking on things that don't bring value. This lack of product management feels extremely unmotivating for me.

Vincent-Maladiere avatar Jul 15 '25 09:07 Vincent-Maladiere

We've been trying to account for users' feedback, keeping a channel as open as possible. I personally really value the fact that this specific RFC has been open. My apologies if I gave the wrong impression.

On this specific topic, I do not know what the right way forward is. I think that the signal that @MarieSacksick is giving us is a very interesting one. If the signal is that for many users switching from namespace (.skb.) to prefix (.skb_) makes adoption much easier (which I find perfectly understandable), then we should do it.

What I would love is to gather user feedback on this. Maybe @rcap107 you can do two "mocks" and present them to a few people. We could do a poll on social media, but the challenge is how to word it, as the wording creates the answer. Let's think about it. Any suggestions on gather user feedback, by the way?

One thing to keep in mind is that the DataOps are a really big feature, and thus we've working on them for a very long time, and, well, we're tired :). I don't think that we haven't gather user feedback on them: we've asked our network to try them out and tell us what they think, and we've adapted the design based on this. Still, bootstrapping a new vision from nothing is very hard, and that's what we're doing. I hope that in the not-to-distant future, we can be very proud of what we have achieved.

With regards to roadmap and product management, it's a hat that I've been mostly taking. I wasn't sure what this meant until recently, but I've read Marty Cagan's book and I feel more confident about things. I've been trying to build a vision and prioritize what we do. I do not believe that I can fully spell at a roadmap, nor do I believe that it is the right thing to do. I must confess that I am adapting dynamically the priorities based on what I hear from the team and users. I understand that it may create confusion, and I try not to change directions too often. I'm happy to have a reflection on how to do this best.

I'll conclude with what I believe and what I read in Marty Cagan's book: it is crucial to keep listening to every stakeholder, in particular our target users

GaelVaroquaux avatar Jul 15 '25 12:07 GaelVaroquaux

Thank you for this answer and for addressing my concern. That might have sounded harsher than I intended, sorry about that. Let's think about how to gather more user feedback to get a stronger signal, indeed. There are people we can contact once we do the release.

Vincent-Maladiere avatar Jul 15 '25 12:07 Vincent-Maladiere

What I would love is to gather user feedback on this. Maybe @rcap107 you can do two "mocks" and present them to a few people. We could do a poll on social media, but the challenge is how to word it, as the wording creates the answer. Let's think about it. Any suggestions on gather user feedback, by the way?

Something like this?

#%%
import skrub
from sklearn.linear_model import LogisticRegression
import pandas as pd

df_1 = pd.DataFrame({
    "key": [1, 2, 3, 4, 5],
    "feature2": [5, 4, 3, 2, 1],
    "target": [0, 1, 0, 1, 0]
})

df_2 = pd.DataFrame({
    "key": [1, 2, 3, 4, 5],
    "feature1": [1, 2, 3, 4, 5],
})

data = skrub.var("data", df_1)
additional_data = skrub.var("additional_data", df_2)

# Version 1: .skb namespace
X = data.drop(columns=["target"], axis=1).skb.mark_as_X()
y = data["target"].skb.mark_as_y()
X_joined = X.merge(additional_data, on="key", how="left")
X_vec = X_joined.skb.apply(skrub.TableVectorizer())
predictor = X_vec.skb.apply(LogisticRegression(), y=y)
predictor.skb.cross_validate(cv=2)
learner = predictor.skb.get_learner()
learner.fit({"data": df_1, "additional_data": df_2})

# Version 2: skb_* functions
X = data.drop(columns=["target"], axis=1).skb_mark_as_X()
y = data["target"].skb_mark_as_y()
X_joined = X.merge(additional_data, on="key", how="left")
X_vec = X_joined.skb_apply(skrub.TableVectorizer())
predictor = X_vec.skb_apply(LogisticRegression(), y=y)
predictor.skb_cross_validate(cv=2)
learner = predictor.skb_get_learner()
learner.fit({"data": df_1, "additional_data": df_2})

# Version 3: using skrub objects as first-class citizens
X = data.raw().drop(columns=["target"], axis=1).mark_as_X()
y = data.raw()["target"].mark_as_y()
X_joined = X.raw().merge(additional_data, on="key", how="left")
X_vec = X_joined.apply(skrub.TableVectorizer())
predictor = X_vec.apply(LogisticRegression(), y=y)
predictor.cross_validate(cv=2)
learner = predictor.get_learner()
learner.fit({"data": df_1, "additional_data": df_2})

rcap107 avatar Jul 15 '25 13:07 rcap107

AFAIU, version 3 would looks like:

- X = data.drop(columns=["target"], axis=1).mark_as_X()
- y = data["target"].mark_as_y()
+ X = data.raw().drop(columns=["target"], axis=1).mark_as_X()
+ y = data.raw()["target"].mark_as_y()

Vincent-Maladiere avatar Jul 15 '25 13:07 Vincent-Maladiere

I realized late that I did not have anything about the namespace in the slide deck, and I think that did not help with understanding how to use the function.

I should be able to understand it by myself quite easily, without needing you to explain! So not your fault :).

If I can elaborate on the why: it's not so much about forgetting the namespace, but rather, I don't really comprehend what is that intermediate object data.skb. Now I understand that it's more or less "nothing", but it's weird to me at first.

MarieSacksick avatar Jul 15 '25 17:07 MarieSacksick

@Vincent-Maladiere you are right of course, and sorry for pushing back too quickly and too hard on the skrub_ solution @MarieSacksick .

To me the deeper issue is that we are trying to smash 2 unrelated interfaces into 1, by having the dataops pretend that they are a dataframe (or numpy array, or whatever the result is) when they really are something of an entirely different nature, with a different behavior. This looks nice in simple examples but actually creates lots of confusion and subtle bugs, whether we avoid conflicts via namespaces or via a naming convention. We attempt to make "computing something now" and "adding a step to the learner" look the same, when we should be helping users to tell them apart easily.

For example if we have

>>> import polars as pl
>>> import skrub
>>> df = pl.DataFrame({'a': range(4), 'b': range(4, 8)})
>>> X = skrub.X(df)

and quite a few lines later

>>> X.filter(pl.col('a') > pl.col('a').mean())

without context it is hard to tell that X is actually a dataop and that filter doesn't run immediately. the fact that it looks the same as if X were a dataframe makes it harder to reason about the code, not easier. And we cannot just squint our eyes and pretend that it is normal eager code because debugging becomes hard if we ignore that difference. For example the line above works but

>>> X.with_row_index().filter(pl.col('index') < X.shape[0] / 2)
>>> #                                        ^^^ oops

doesn't (the < comparison has a polars object as the lhs and a dataop as the rhs and the polars object raises instead of returning NotImplemented). Many other things can be puzzling when we mix up "things that run now" and "things that run when we call fit()", such as "why can't I iterate over X.columns", subtle bugs involving global variables etc.

So as @rcap107 found out, it is easier to put anything non-trivial in a deferred function or a scikit-learn estimator, because then our code manipulates actual normal python objects and it is much harder to get stung by the fact that an operation that looks like it is run now is actually just being scheduled to run when we call fit(). Then the boundary between normal/dataframe-manipulation code and lazy/dataplan-manipulation code is clear and the interaction between the 2 consists only of few well-identified entry points .skb.apply() and .skb.apply_func()/skrub.deferred(...)().

If we were to drop the dynamic attribute and operators, and have the dataops objects be just skrub dataops with only the functions currently in .skb, we could save users some headaches, have much less explaining to do in the docs, and drastically simplify the skrub codebase. We also wouldn't need the skb. nor skb_ anymore. To make sure we don't slip into creating a small dataframe api and that the "dataop-returning" functions remain very few and well-identified I would even remove select, drop and concat and use eg apply(skrub.SelectCols()) instead. Anything like a groupby or join would have to be done in a function, or with a skrub method we add if there needs to be some fittable state.

So maybe the "pretend to be the result object" thing was a good experiment to make, to go 1 step beyond the "too magic" boundary and find where it is, and it could be a good idea now to take a step back and strike a reasonable balance between code that is not too verbose and code that is easy to debug. the magic thing is quite convenient though, so I'm really not sure on this one. I think it would be worth trying to rewrite a few examples using only what is in .skb and find out what primitives would be missing to make that manageable.

jeromedockes avatar Jul 15 '25 18:07 jeromedockes

Given the above, one thing to consider when choosing between skb_ and skb. is which one more clearly sets apart dataframe functions from skrub functions, I think a case could be made for both (having them in separate namespaces / dir outputs, or having an obvious marker directly in the name)

jeromedockes avatar Jul 15 '25 18:07 jeromedockes

regarding user feedback indeed it is important to gather it but note it will remain very scarce as long as it is only possible to try the feature by installing a development branch. Showing snippets and collecting opinions is useful, but reading a small snippet written by someone else and actually using the library to write a script for your own data is very different. For example for the very valid point that it is easy to forget the .skb: I think some people would not detect that while glancing at an example snippet but would experience it when trying to write their own code with it.

jeromedockes avatar Jul 15 '25 18:07 jeromedockes

So maybe the "pretend to be the result object" thing was a good experiment to make, to go 1 step beyond the "too magic" boundary and find where it is, and it could be a good idea now to take a step back and strike a reasonable balance between code that is not too verbose and code that is easy to debug. the magic thing is quite convenient though, so I'm really not sure on this one.

I get really good feedback on this from users to whom I present this feature.

I would be interested in @ogrisel's thought on this: he was external to these design choices, but has played a fair amount with skrub if I understand well.

GaelVaroquaux avatar Jul 16 '25 07:07 GaelVaroquaux

regarding user feedback indeed it is important to gather it but note it will remain very scarce as long as it is only possible to try the feature by installing a development branch.

Right, and we should consider:

  1. Marking this feature experiment (in the docs)

  2. Doing changes to the API based on user feedback later

GaelVaroquaux avatar Jul 16 '25 07:07 GaelVaroquaux

regarding user feedback indeed it is important to gather it but note it will remain very scarce

How scarce depends on the willingness of skrub developers to go out and sit with people to try data-ops on their laptops. It's up to us.

Vincent-Maladiere avatar Jul 18 '25 07:07 Vincent-Maladiere

So maybe the "pretend to be the result object" thing was a good experiment to make, to go 1 step beyond the "too magic" boundary and find where it is, and it could be a good idea now to take a step back and strike a reasonable balance between code that is not too verbose and code that is easy to debug. the magic thing is quite convenient though, so I'm really not sure on this one.

I think that's a reasonable solution, and would make data ops easier to grasp indeed. Plus, it's closer to existing paradigms like metaflow, in which you define a graph of operations as methods of a metaflow class, decorated with a step function (vs deferred for us). So, a clear +1 for this solution.

Vincent-Maladiere avatar Jul 18 '25 07:07 Vincent-Maladiere

So maybe the "pretend to be the result object" thing was a good experiment to make, to go 1 step beyond the "too magic" boundary and find where it is, and it could be a good idea now to take a step back and strike a reasonable balance between code that is not too verbose and code that is easy to debug. the magic thing is quite convenient though, so I'm really not sure on this one.

I think that's a reasonable solution, and would make data ops easier to grasp indeed. Plus, it's closer to existing paradigms like metaflow, in which you define a graph of operations as methods of a metaflow class, decorated with a step function (vs deferred for us). So, a clear +1 for this solution.

Would this mean refactoring a major part of the backend of the code? If we go with this solution (which I am fine with, for the record), I think we should do this in the next release, and in the meantime modify the examples so that the "pretend to be the result object" thing does not come up and use deferred functions all the time instead.

rcap107 avatar Jul 18 '25 08:07 rcap107

So maybe the "pretend to be the result object" thing was a good experiment to make, [...] take a step back

Now that we've done it (I've enjoyed that road hugely), I think that we should release and gather more user feedback in real life settings.

We should clearly mark this as experimental in the documentation.

I have heard that a lot can be done with "deferred" and "apply", and could probably be enough. They do require a shift of thinking from the imperative succession of small operations that people are used to with pandas (and that are easy to fathom without training in computer science). I think that something great that we can do is use the DataOps magic mode to onboard people, but slowly nudge them to more advanced patterns with deferred and apply. This can be done via dedicated examples and lightweight touches in the docs and examples.


And now, still a question for me: should we change from ".skb.xxx" to ".skb_xxx" before the release?

GaelVaroquaux avatar Jul 18 '25 11:07 GaelVaroquaux

And now, still a question for me: should we change from ".skb.xxx" to ".skb_xxx" before the release?

My personal preference is with .skb.xxx, but if others prefer skb_xxx I am fine with changing

rcap107 avatar Jul 18 '25 12:07 rcap107

I'm personally +0.5 for skb_xxx from Marie's feedback. At St Gobain, people also struggled to grasp the nature of this skb namespace. Not +1 because we delay the release further.

Vincent-Maladiere avatar Jul 18 '25 14:07 Vincent-Maladiere