skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Clarifying objects and terminology: "DataOps make Learners"

Open GaelVaroquaux opened this issue 7 months ago • 7 comments

Today, I discuss with many people the srkub pipeline and the various objects and methods. And it seems important to me to clarify the objects around the skrub pipeline and the corresponding terminology.

In particular, we have two types of very objects (and it was not clear in my head):

  • objects that have a ".skb" namespace. We call them "Expressions" now. I propose to call them "DataOps", because their purpose is to link data to operations
  • objects that have a "fit()" method. These are not scikit-learn estimators because the fit and predict have a different signature. We call them "Pipelines" currently. I propose to call them "Learners", because 1) they have "fit" and the intuition is that a learner should have a fit, and 2) the term is not overloaded, as opposed to "pipeline" (it's a generic term, and every community has a different understanding of what a pipeline is) or "estimator" (it's a general statistical concept.

Also, I came to realize that it is very important to be explicit and to understand well when we jump from one world (DataOps) to another world (Learner). It is something that we need to do at some moments. Currently, this is done mostly by methods called "get_xxx". I suggest to rename those methods to replace "get_" which is a very generic name, and use "make_" (open to other suggestions) and then be very explicit: every "make_xxx" method creates a Learner from a DataOps.

Finally, I suggest to have a slight more active and short name for "get_randomized_search" : "make_random_search". This might be the function that we explain first when discussing tuning of DataOps.

I'm opening a discussion, and if we converge the action point would be to revisit all our material (docs, doc strings, example, API) to be very very systematic about these namings and corresponding concepts. I think that it would be important to facilitate discovery and adoption.

GaelVaroquaux avatar May 06 '25 14:05 GaelVaroquaux

We could also have "DataPipe" rather than "DataOps". DataOps is plural, so the sentences are weird. "DataPipe" kinda echoes "DataFrame"

GaelVaroquaux avatar May 06 '25 19:05 GaelVaroquaux

How about "FrameOps" rather than "DataOps" to make it explicit that we are working with dataframes? I feel like "Data" is another very overloaded term.

rcap107 avatar May 12 '25 10:05 rcap107

I have no objection to renaming get_pipeline to make_learner. We just need to decide what to do with skrub.tabular_learner() which according to this new terminology does not return a learner.

Regarding what is currently called an "expression":

I still think we should have "expression" or "expr" in the name because similar objects (lazily evaluated expressions) in dataframe libraries are called "expressions":

comparison of those libraries on an example

the example from polars user guide:

>>> import skrub
>>> import polars as pl

>>> df = pl.DataFrame({"weight": [60, 70], "height": [180, 175]})

polars

define the expression

>>> x = df
>>> e = pl.col("weight") / (pl.col("height")**2)

evaluate

>>> x.select(e)
shape: (2, 1)
┌──────────┐
│ weight   │
│ ---      │
│ f64      │
╞══════════╡
│ 0.001852 │
│ 0.002286 │
└──────────┘

ibis

>>> import ibis
>>> from ibis import _

define the (table) expression

>>> x = ibis.memtable(df.to_pandas())
>>> e = _.weight / _.height**2

evaluate

>>> x.select(e).to_pandas()
   Divide(weight, Power(height, 2))
0                          0.001852
1                          0.002286

datatable

>>> import datatable as dt
>>> from datatable import f

define the (f-) expression

>>> x = dt.Frame(df.to_pandas())

>>> e = f.weight / f.height**2

evaluate

>>> x[:, e]
   |         C0
   |    float64
-- + ----------
 0 | 0.00185185
 1 | 0.00228571
[2 rows x 1 column]

python

define the expression

>>> x = df
>>> e = "x['weight'] / x['height']**2"

evaluate

>>> eval(e, {'x': x})
shape: (2,)
Series: 'weight' [f64]
[
	0.001852
	0.002286
]

pandas

define the (lambda) expression

>>> x = df.to_pandas()
>>> e = lambda x: x['weight'] / x['height']**2

evaluate

>>> x.pipe(e)
0    0.001852
1    0.002286
dtype: float64

skrub

define the expression

>>> x = skrub.var('x')
>>> e = x['weight'] / x['height']**2

evaluate

>>> e.skb.eval({'x': df})
shape: (2,)
Series: 'weight' [f64]
[
	0.001852
	0.002286
]

Also those skrub objects represent expressions (variables, operator applications, function calls etc.)

Finally I think it is a more different term and thus helps to make more clearly the distinction between the expression (the thing you write) and the learner (the thing with a fit method).

If there is a consensus that we do not want this term, I would suggest something like "plan" or "blueprint" to highlight that those objects describe the plan that is used to build the learner (and "plan" is reminescent of a query plan which is a somewhat related concept at a very high level). maybe something with "flow" could also make sense.

I'm not sure DataPipe, DataOps or FrameOps help understand what those objects are or the relationship/difference with the learners.

jeromedockes avatar May 15 '25 14:05 jeromedockes

I like DataPlan a lot.

GaelVaroquaux avatar May 16 '25 05:05 GaelVaroquaux

I'll try to summarize the various discussions we had about this topic with the skrub devs.

  • We have two distinct "objects", the object with the .skb namespace and that is currently called "expression", and the objects with fit, which are called pipeline .
  • Overall, there was no disagreement on renaming "pipelines" to "learners"
  • "Expressions" is still overall favored over the alternatives ("DataPlan", "DataOp").
  • One advantage of using "Expressions" is that we can offload some of the explanation of the term to Polars
  • Renaming the Expressions to "DataPlan" or "DataOp" would probably require renaming more objects, and combining "DataPlans" carries a different connotation from combining "Expressions" (e.g., "Expr + Expr" does the sum between the two expressions, but what does "DataPlan + DataPlan" mean?)
  • The overall consensus is "keep expressions, rename pipeline to learner"

From here on, I'm reporting my own take on the subject:

  • I like the name "expressions" a lot, but I am biased because I also like Polars, and at this point I am used to the name from working on it for a long time.
  • Reusing the name "expression" has the advantage that people in the PyData ecosystem at least an idea of the concept, and this is not the case for DataOps or DataPlan (I may be wrong on this)
  • I find starting from a unitary "DataPlan" far less intuitive than starting from an expression Variable that is modified through another expression
  • Given the premise of constructing and then replaying pipelines, the first part (the construction) is done by combining expressions, and the second consist in taking the result of the construction (the learner) and putting that in production. I prefer the term "learner" to that of "pipeline" here because it is a more distinct name that is easier to describe as the result of the first step.
  • I also liked the term "data plan" to describe all the combinations of expressions that are needed to build a "learner", however adding a new term that does not appear in the API may be counterproductive as it may be confusing to people that are learning about the library for the first time. To me "data plan" conveys well the idea of modifying data in a systematic and ordered way.
  • I don't have strong feelings on the learner/pipeline choice, but I do prefer learner
  • I strongly prefer "expression" to either DataPlan or DataOp
  • I am also worried that if we decide to rename expressions to DataPlan/DataOp, we might have to rename other objects and will be stuck in the exact same situation we are now.

rcap107 avatar Jun 23 '25 13:06 rcap107

thanks a lot for this summary @rcap107 . I agree with everything you say 🙂 , including the second ("my own take") part

jeromedockes avatar Jun 23 '25 20:06 jeromedockes

I spent more time trying to come up with a different way to explain the expressions and this is what I came up with.

Premise: I was trying to keep in mind the use case of a data scientist that needs to prepare data and build a predictor, then put the resulting pipeline in production, ensuring that all the operations and the parameters are tracked correctly, and that there is no data leakage.

I really like the term "Data Plan" to describe everything that is done as part of the preprocessing/feature engineering/training, so my idea to describe what are now called "expressions" is this:

A data plan contains Variables and Data Operations (Data Ops), where DataOps act on the variables and wrap User Operations to keep track of their parameters. User operations could be dataframe operations (selection, merge, group by etc.), scikit-learn estimators (a random forest with its hyperparameters), or arbitrary code (for loading the data, converting etc.).

The Data Plan records the sequence of DataOps and their parameters, and can be synthesized/distilled/exported (term TBD) as a standalone object called the learner. The learner replays the operations that have been recorded by the data plan on unseen data, ensuring that the same user operations and parameters are used.

So, the skrub Data Plan represents a combination of variables and data ops that can be transformed into a standalone object (the learner), which can then be saved on disk and loaded from a separate script to execute predictions on unseen data.

This is the very high level overview of the "data plan", which in a talk would then be followed by the credit fraud example to explain leakage and more stuff.

I found this wording to be easier to explain, and I think it conveys much better the process of "combining single data operations into a full data plan, which can then be exported as a single object".

I also think that "record" and "replay" convey well the process. Another term I think is important is "wrapper", as the expressions "wrap" around user objects.

Changes to the current implementation I think we should rename expressions to data operations (or data ops), what is now the .Expr namespace to .DataOp (or .DataOps), and either the .skb.full_report() or the .skb.graph() to .skb.data_plan() (or something like that). I don't think we need to add a "DataPlan" object to the namespace.

I am also working on a new version of the first example where I'm trying to implement the new names and highlight more how the learner can be applied to unseen data.

@Vincent-Maladiere also liked this change in terminology

rcap107 avatar Jun 30 '25 15:06 rcap107

Thanks for your super thorough and helpful work.

I think that this is a great suggestion overall. I think that I would go for DataOps rather than DataOp. One of the reason, is that DataOps mirrors MLOps, or DevOps. I think that it makes an especially cool name in this respect.

I would still like a tiny bit of user study. Do you think that you could try this write up, along side with the corresponding example, on a few users. The best way to do this would probably to write this (eg in the doc, in a PR, with a link to the example), have a user read this, and then ask the user to explain what the purpose and benefit of this feature is (explain the exercice and the purpose to the user)

GaelVaroquaux avatar Jun 30 '25 21:06 GaelVaroquaux

And, honestly "Skrub DataOps" is such a good catch phrase. This could be the phrasing that we put forward as much as possible

GaelVaroquaux avatar Jun 30 '25 21:06 GaelVaroquaux

I've updated the starting example in #1481 to try and implement some of the new terminology

rcap107 avatar Jul 01 '25 15:07 rcap107

Thanks for your super thorough and helpful work.

I think that this is a great suggestion overall. I think that I would go for DataOps rather than DataOp. One of the reason, is that DataOps mirrors MLOps, or DevOps. I think that it makes an especially cool name in this respect.

I would still like a tiny bit of user study. Do you think that you could try this write up, along side with the corresponding example, on a few users. The best way to do this would probably to write this (eg in the doc, in a PR, with a link to the example), have a user read this, and then ask the user to explain what the purpose and benefit of this feature is (explain the exercice and the purpose to the user)

About this, maybe @Vincent-Maladiere can use the example and this discussion to test the waters in the Saint Gobain sprint tomorrow

rcap107 avatar Jul 02 '25 08:07 rcap107

Adding more comments from discussing with others

Concerning DataOp vs DataOps: the name of a singular operation should be "Data Op", but as a collective name for the feature "DataOps" makes sense. In docstrings and examples, they would be either "Data Op" or "Ops" depending on context.

An alternative would be going directly for "skrub ops" rather than "data ops" given that user ops also act on data.

Given an example like d=X().skb.apply(imputer).skb.apply(scaler), X() is a variable, each .skb.apply() is a data op, the result is also a data op. Collectively, the entire thing is a "data plan", which can be accessed from each data op through (currently) draw_graph. This is the kind of structure I have in my mind:

Image

DataOps may be applied either to Variables (which become the "entry point" into the data plan) or to other data ops.

I understand the problem in conveying the difference between data ops and data plan here and I am not yet sure how to do that.

This is similar to the issue with dealing with the learner`, so my explanation is something along the lines of "you compose data ops until you're done, and then you can convert all operations so far (the data plan) into a single standalone object (the learner)", and the data plan is accessible like the learner via a method on each data op.

In my previous message, I used terms like "record and replay" and "wrapper". I think those terms convey well the meaning, and I will be using them in for explaining this feature, but they're not meant to be formalized as part of the API.

The separation between user ops and data ops is similar to what the OS is doing with kernel space and user space, but I am not sure how much emphasis we should put on this when we explain the feature (maybe as a note?)

The .full_report and draw_graph functions should not be replaced directly to data_plan, to avoid giving the impression we are converting a data op into a data plan.

I mixed up parameters and arguments: the plan records the arguments (values at function call) rather then the parameters.

rcap107 avatar Jul 02 '25 09:07 rcap107

Putting a pin in this: it's also possible to reuse the "pipeline"/"learner" simply for preprocessing, which might still be interesting for someone that wants to repeat the same transformations on different data, but doesn't need to train a model. This is also what the .skb.truncated_after is used for.

A problem that comes up is that this would still be a "learner" with the new naming, but it wouldn't be learning much

rcap107 avatar Jul 02 '25 12:07 rcap107

thank you very much for all this work @rcap107 !

jeromedockes avatar Jul 06 '25 15:07 jeromedockes

About this, maybe @Vincent-Maladiere can use the example and this discussion to test the waters in the Saint Gobain sprint tomorrow

Unfortunately, I failed to get people excited about this, and people didn't resonate with the motivations behind the data plan. Let's see if we're luckier during the next sprints.

Vincent-Maladiere avatar Jul 07 '25 08:07 Vincent-Maladiere