broom.helpers
broom.helpers copied to clipboard
Improve output for workflows/parsnip objects estimated with tidymodels
The tidymodels objects don't have the same 'terms' information in a standard model. But I do think all this information is stored in the object. Like I mentioned before, I am no tidymodels expert, but here I some tidbits I think are helpful for tackling this issue.
- workflows/parnsip models don't use
model.frame()
andmodel.matrix()
. Rather they usemodel_frame()
andmodel_matrix()
that do not remove missing data from the return data frame/matrix. - I think there is something similar for terms?
- I think parsnip has it's own tidiers? Not sure if it's full featured with all models supported? Not sure how workflows fall into this.
- Given a workflow object, there is a function to extract the parsnip fit. There is a followup function that you can use to extract the original model fit from there.
- tidymodels creates dummy variables from categorical variables. This could pose an issue identifying the underlying variable, and subsequently adding header rows.
I am not sure what the best way to add support for these objects, but I think it'll be a combination of updates to broom.helpers and gtsummary (passing a default tidier for these model types). I think this will be helpful to add, but to be honest, I am not sure I have the time right now. So we can just keep this open until one of us can get to it?
I do not see any model_frame()
or model_matrix()
functions in tidymodels
.
However, it seems that the original output of the modelling functions is kept in model_fit
objects and could be used by broom.helpers
.
library(broom.helpers)
library(gtsummary)
#>
#> Attachement du package : 'gtsummary'
#> Les objets suivants sont masqués depuis 'package:broom.helpers':
#>
#> all_continuous, all_contrasts
library(tidymodels)
trial$response <- factor(trial$response)
mod <- logistic_reg() %>%
set_engine("glm") %>%
fit(response ~ age + stage + grade, data = trial)
class(mod)
#> [1] "_glm" "model_fit"
original_fit <- mod$fit
tidy_plus_plus(original_fit) %>% knitr::kable()
term | variable | var_label | var_class | var_type | var_nlevels | contrasts | contrasts_type | reference_row | label | n_obs | n_event | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | age | Age | numeric | continuous | NA | NA | NA | NA | Age | 183 | 58 | 0.0193570 | 0.0114933 | 1.6841945 | 0.0921441 | -0.0028454 | 0.0424236 |
stageT1 | stage | T Stage | factor | categorical | 4 | contr.treatment | treatment | TRUE | T1 | 50 | 18 | 0.0000000 | NA | NA | NA | NA | NA |
stageT2 | stage | T Stage | factor | categorical | 4 | contr.treatment | treatment | FALSE | T2 | 51 | 13 | -0.5676561 | 0.4432868 | -1.2805618 | 0.2003476 | -1.4535410 | 0.2939708 |
stageT3 | stage | T Stage | factor | categorical | 4 | contr.treatment | treatment | FALSE | T3 | 39 | 14 | -0.0961995 | 0.4570279 | -0.2104893 | 0.8332858 | -1.0030772 | 0.7976585 |
stageT4 | stage | T Stage | factor | categorical | 4 | contr.treatment | treatment | FALSE | T4 | 43 | 13 | -0.2679732 | 0.4536436 | -0.5907130 | 0.5547127 | -1.1710339 | 0.6171288 |
gradeI | grade | Grade | factor | categorical | 3 | contr.treatment | treatment | TRUE | I | 65 | 21 | 0.0000000 | NA | NA | NA | NA | NA |
gradeII | grade | Grade | factor | categorical | 3 | contr.treatment | treatment | FALSE | II | 58 | 17 | -0.1731542 | 0.4025511 | -0.4301422 | 0.6670922 | -0.9709344 | 0.6143866 |
gradeIII | grade | Grade | factor | categorical | 3 | contr.treatment | treatment | FALSE | III | 60 | 20 | 0.0443406 | 0.3889227 | 0.1140087 | 0.9092309 | -0.7215734 | 0.8092799 |
Created on 2022-06-20 by the reprex package (v2.0.1)
We could probably support parsnip
models whose engines are already supported by broom.helpers
Regarding workflow
objects, we can use extract_fit_parsnip()
to get the model_fit
object. We have to decide if we should support workflow object or if it should be the responsibility of the user to extract the model_fit.
The tidymodels objects don't have the same 'terms' information in a standard model. But I do think all this information is stored in the object. Like I mentioned before, I am no tidymodels expert, but here I some tidbits I think are helpful for tackling this issue.
parsnip
is just a wrapper. Therefore, it depends of the engine used and the type of model.
- workflows/parnsip models don't use
model.frame()
andmodel.matrix()
. Rather they usemodel_frame()
andmodel_matrix()
that do not remove missing data from the return data frame/matrix.
I didn't find any generic model_frame()
or model_matrix()
. Are you sure it is part of tidymodels
?
- I think there is something similar for terms?
- I think parsnip has it's own tidiers? Not sure if it's full featured with all models supported? Not sure how workflows fall into this.
It seems featured for all models covered by parsnip
. With the current proposal in #161, we use the parsnip tidier while extracting the other information from model$fit
.
- Given a workflow object, there is a function to extract the parsnip fit. There is a followup function that you can use to extract the original model fit from there.
TRUE. This is what is proposed in #161
- tidymodels creates dummy variables from categorical variables. This could pose an issue identifying the underlying variable, and subsequently adding header rows.
I am not sure what the best way to add support for these objects, but I think it'll be a combination of updates to broom.helpers and gtsummary (passing a default tidier for these model types). I think this will be helpful to add, but to be honest, I am not sure I have the time right now. So we can just keep this open until one of us can get to it?
This is a good question. First of all, this is not an obligation. You can use tidymodels
without converting factors into dummy variables. This is obligatory only for the engines/models that do not support factors like glmnet
.
Second, if step_dummy()
is applied, this is a numeric binary variable which is passed to the model and not a binary factor. As long as it has been transformed before modelling, should we assume that the user want several binary variables rather than a contrast? So far, when a numeric binary variable (e.g. trial$death
) is passed to a model, we still assume it is a continuous variable and we do not add a reference level. See examples below.
Re-identifying how data was transformed before modelling would require working with a workflow
object and not only with the model_fit
. And it may be quite a challenge to reidentify all the transformations. However, I'm not sure that it would be a good idea. The data could have been transformed for many many reasons.
library(broom.helpers)
library(gtsummary)
#>
#> Attachement du package : 'gtsummary'
#> Les objets suivants sont masqués depuis 'package:broom.helpers':
#>
#> all_continuous, all_contrasts
library(tidymodels)
trial$response <- factor(trial$response)
rec1 <- recipe(response ~ age + stage + death, data = trial)
rec2 <- rec1 %>% step_dummy(stage)
mod1 <- workflow(rec1, logistic_reg()) %>%
fit(trial) %>%
extract_fit_parsnip()
mod2 <- workflow(rec2, logistic_reg()) %>%
fit(trial) %>%
extract_fit_parsnip()
tidy_plus_plus(mod1)
#> # A tibble: 6 x 18
#> term variable var_label var_class var_type var_nlevels contrasts
#> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 age age Age numeric continuous NA <NA>
#> 2 stageT1 stage stage factor categorical 4 contr.treatme~
#> 3 stageT2 stage stage factor categorical 4 contr.treatme~
#> 4 stageT3 stage stage factor categorical 4 contr.treatme~
#> 5 stageT4 stage stage factor categorical 4 contr.treatme~
#> 6 death death Patient Died integer continuous NA <NA>
#> # ... with 11 more variables: contrasts_type <chr>, reference_row <lgl>,
#> # label <chr>, n_obs <dbl>, n_event <dbl>, estimate <dbl>, std.error <dbl>,
#> # statistic <dbl>, p.value <dbl>, conf.low <dbl>, conf.high <dbl>
tidy_plus_plus(mod2)
#> # A tibble: 5 x 18
#> term variable var_label var_class var_type var_nlevels contrasts
#> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 age age Age numeric continuous NA <NA>
#> 2 death death Patient Died integer continuous NA <NA>
#> 3 stage_T2 stage_T2 stage_T2 numeric continuous NA <NA>
#> 4 stage_T3 stage_T3 stage_T3 numeric continuous NA <NA>
#> 5 stage_T4 stage_T4 stage_T4 numeric continuous NA <NA>
#> # ... with 11 more variables: contrasts_type <chr>, reference_row <lgl>,
#> # label <chr>, n_obs <dbl>, n_event <dbl>, estimate <dbl>, std.error <dbl>,
#> # statistic <dbl>, p.value <dbl>, conf.low <dbl>, conf.high <dbl>
Created on 2022-06-21 by the reprex package (v2.0.1)
Oy sorry I have so many things wrong about tidymodels 😆
-
I must have been thinking of
mode_frame()
in hardhat (also a part of tidymodels) and thought it was used everywhere https://hardhat.tidymodels.org/reference/model_frame.html -
I'll take a look at the PR you created! Awesome! this is what I was doing in gtsummary (well, still am, but I"ll update gtsummary after broom.helpers is released).
-
OK, i think i misunderstood the default behaviour in workflows. I thought the default was to create dummy variables for all factors/characters with the default "blueprint". Here's how i was checking for it in gtsummary
#' @export
#' @rdname tbl_regression_methods
tbl_regression.model_fit <- function(x, ...) {
message("Extracting {parsnip} model fit with `tbl_regression(x = x$fit, ...)`")
tbl_regression(x = x$fit, ...)
}
#' @export
#' @rdname tbl_regression_methods
tbl_regression.workflow <- function(x, ...) {
assert_package("workflows", "tbl_regression.workflow()")
if (isTRUE(!x$pre$actions$formula$blueprint$indicators %in% "none")) {
paste("To take full advantage of model formatting, e.g. grouping categorical",
"variables, please add the following argument to the `workflows::add_model()` call:") %>%
stringr::str_wrap() %>%
paste("`blueprint = hardhat::default_formula_blueprint(indicators = 'none')`", sep = "\n") %>%
paste("\n") %>%
rlang::inform()
}
paste("Extracting {workflows} model fit with",
"`workflows::extract_fit_parsnip(x) %>% tbl_regression(...)`") %>%
message()
tbl_regression(x = workflows::extract_fit_parsnip(x), ...)
}
FYI I am going to submit a gtsummary release this week. Perhaps for the next release we can coordinate for a unified experience for these workflows and parnsip objects
FYI I am going to submit a gtsummary release this week. Perhaps for the next release we can coordinate for a unified experience for these workflows and parnsip objects
With pleasure and we can organize a quick zoom call if it is more convenient.
Regards
Closing it for now