parsnip
parsnip copied to clipboard
model vs engine
I am developing a package which implements extensions to parsnip
.
I am trying to understand the parsnip
design principle of a model versus an engine so that I can implement the extensions without incurring redundancy or confusion.
I have read the Details section of the documentation pages for parsnip
models:
"This function only defines what type of model is being fit. Once an engine is specified, the method to fit the model is also defined."
and I am following the excellent parsnip
guide for building new models, but the boundary between model and engine is not yet completely clear for me.
Here are the two particular cases on my mind:
-
Logistic regression with a specific linear equality constraint on the regression coefficients (and with L1 and/or L2 regularization). Should this be a new engine to
logistic_reg
, or a new model entirely? It would have the exact same main arguments aslogistic_reg
, so I am leaning towards "engine", but the constraint makes the fitted model fundamentally different from the fitted model that can be obtained with any other engine forlogistic_reg
. The constraint also induces a special interpretation of the fitted regression coefficients which is useful in its own right which the user would want to know about, for what it's worth. -
k-nearest neighbors with a user-defined distance function. Should this be a new engine to
nearest_neighbor
, or a new model entirely? It would require a new main argument for the model (e.g.dist_func
) - is that too much for an engine? No other engine could produce this result because they do not accept arbitrary distance functions.
Expanding on these two specific cases, here are some overarching questions about the boundary between model and engine:
-
Conceptually, are engines just different underlying implementations for achieving essentially the same range of fitted models? That is, should we be able to take any two engines for the same model, and with enough parameter tweaking, obtain essentially the same fitted model from each?
-
If a candidate engine for a model would be able to return a fitted model that is fundamentally different from the fitted models returned from the other engines for the same model - not just a better fit, but fundamentally different in that other fits do not approximate it (e.g. constrained regression coefficients), - then should that candidate engine in fact be a new standalone model?
-
If a candidate model is very similar to an existing model in
parsnip
, does having different "main arguments" mean that the candidate model should be implemented as a new standalone model inparsnip
rather than as a new engine for the extant model?
Think of the model type as the structure of the model/prediction equation. So logistic_reg()
captures models that have a linear predictor that models some monotonic transformation of a binomial/Bernoulli probability.
The engine can be conceptualized by the estimation method. So logistic_reg(engine = "glm")
means basic maximum likelihood, logistic_reg(engine = "stan")
means Bayesian estimation etc.
I would encourage you to re-use as many parsnip model functions as you can and add new engines. I think that the main parameters are frozen or, at least, we'd add a new argument if it is a commonly used parameter that we've missed.
The good thing is that we can automate the usage and tuning of engine arguments pretty easily. For example, see this blog post.
I suggest working on a parsnip-adjacent package and ask for review before you finalize it. You certainly don't need our approval but there have been a number of parsnip-adjacent packages lately that work but have interfaces that are fundamentally opposite of what we are trying to do with tidymodels and the tidyverse.
Some specific answers to your questions:
should we be able to take any two engines for the same model, and with enough parameter tweaking, obtain essentially the same fitted model from each?
No. They can be different estimators of the same structural equation
If a candidate engine for a model would be able to return a fitted model that is fundamentally different from the fitted models returned from the other engines for the same model ... then should that candidate engine in fact be a new standalone model?
No. The underlying model objects across engines are usually very different. That's encouraged. We want to make sure that the api/experience that users have is the same. Most of that is outlined here.
If a candidate model is very similar to an existing model in
parsnip
, does having different "main arguments" mean that the candidate model should be implemented as a new standalone model inparsnip
rather than as a new engine for the extant model?
No. We'd encourage you to use the same model function (with the same main arguments) and use dials
and a tunable()
method to make it easy to use your engine with engine-specific arguments.
Thank you for the helpful explanations, it is clearer now.
I suggest working on a parsnip-adjacent package and ask for review before you finalize it. You certainly don't need our approval but there have been a number of parsnip-adjacent packages lately that work but have interfaces that are fundamentally opposite of what we are trying to do with tidymodels and the tidyverse.
Sure, I would certainly appreciate feedback and approval. What is the preferred avenue for notifying the team and sharing the prospective package for feedback?
Put it in a GH repo and tag me when you have questions 👍
Put it in a GH repo and tag me when you have questions 👍
OK, I invited you and @juliasilge to the repo and posted some questions as issues and tagged both of you in those issues.
Glad we have this discussion documented publicly! As it's been a couple years since follow-up here, I'll go ahead and close.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.