broom.helpers Improve output for workflows/parsnip objects estimated with tidymodels

The tidymodels objects don't have the same 'terms' information in a standard model. But I do think all this information is stored in the object. Like I mentioned before, I am no tidymodels expert, but here I some tidbits I think are helpful for tackling this issue.

workflows/parnsip models don't use model.frame() and model.matrix(). Rather they use model_frame() and model_matrix() that do not remove missing data from the return data frame/matrix.
I think there is something similar for terms?
I think parsnip has it's own tidiers? Not sure if it's full featured with all models supported? Not sure how workflows fall into this.
Given a workflow object, there is a function to extract the parsnip fit. There is a followup function that you can use to extract the original model fit from there.
tidymodels creates dummy variables from categorical variables. This could pose an issue identifying the underlying variable, and subsequently adding header rows.

I am not sure what the best way to add support for these objects, but I think it'll be a combination of updates to broom.helpers and gtsummary (passing a default tidier for these model types). I think this will be helpful to add, but to be honest, I am not sure I have the time right now. So we can just keep this open until one of us can get to it?

Jun 20 '22 00:06 ddsjoberg

I do not see any model_frame() or model_matrix() functions in tidymodels.

However, it seems that the original output of the modelling functions is kept in model_fit objects and could be used by broom.helpers.

library(broom.helpers)
library(gtsummary)
#> 
#> Attachement du package : 'gtsummary'
#> Les objets suivants sont masqués depuis 'package:broom.helpers':
#> 
#>     all_continuous, all_contrasts
library(tidymodels)

trial$response <- factor(trial$response)

mod <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(response ~ age + stage + grade, data = trial)

class(mod)
#> [1] "_glm"      "model_fit"

original_fit <- mod$fit

tidy_plus_plus(original_fit) %>% knitr::kable()

term	variable	var_label	var_class	var_type	var_nlevels	contrasts	contrasts_type	reference_row	label	n_obs	n_event	estimate	std.error	statistic	p.value	conf.low	conf.high
age	age	Age	numeric	continuous	NA	NA	NA	NA	Age	183	58	0.0193570	0.0114933	1.6841945	0.0921441	-0.0028454	0.0424236
stageT1	stage	T Stage	factor	categorical	4	contr.treatment	treatment	TRUE	T1	50	18	0.0000000	NA	NA	NA	NA	NA
stageT2	stage	T Stage	factor	categorical	4	contr.treatment	treatment	FALSE	T2	51	13	-0.5676561	0.4432868	-1.2805618	0.2003476	-1.4535410	0.2939708
stageT3	stage	T Stage	factor	categorical	4	contr.treatment	treatment	FALSE	T3	39	14	-0.0961995	0.4570279	-0.2104893	0.8332858	-1.0030772	0.7976585
stageT4	stage	T Stage	factor	categorical	4	contr.treatment	treatment	FALSE	T4	43	13	-0.2679732	0.4536436	-0.5907130	0.5547127	-1.1710339	0.6171288
gradeI	grade	Grade	factor	categorical	3	contr.treatment	treatment	TRUE	I	65	21	0.0000000	NA	NA	NA	NA	NA
gradeII	grade	Grade	factor	categorical	3	contr.treatment	treatment	FALSE	II	58	17	-0.1731542	0.4025511	-0.4301422	0.6670922	-0.9709344	0.6143866
gradeIII	grade	Grade	factor	categorical	3	contr.treatment	treatment	FALSE	III	60	20	0.0443406	0.3889227	0.1140087	0.9092309	-0.7215734	0.8092799

^{Created on 2022-06-20 by the reprex package (v2.0.1)}

Jun 20 '22 08:06 larmarange

We could probably support parsnip models whose engines are already supported by broom.helpers

Jun 20 '22 08:06 larmarange

Regarding workflow objects, we can use extract_fit_parsnip() to get the model_fit object. We have to decide if we should support workflow object or if it should be the responsibility of the user to extract the model_fit.

Jun 20 '22 08:06 larmarange

The tidymodels objects don't have the same 'terms' information in a standard model. But I do think all this information is stored in the object. Like I mentioned before, I am no tidymodels expert, but here I some tidbits I think are helpful for tackling this issue.

parsnip is just a wrapper. Therefore, it depends of the engine used and the type of model.

workflows/parnsip models don't use model.frame() and model.matrix(). Rather they use model_frame() and model_matrix() that do not remove missing data from the return data frame/matrix.

I didn't find any generic model_frame() or model_matrix(). Are you sure it is part of tidymodels?

I think there is something similar for terms?

I think parsnip has it's own tidiers? Not sure if it's full featured with all models supported? Not sure how workflows fall into this.

It seems featured for all models covered by parsnip. With the current proposal in #161, we use the parsnip tidier while extracting the other information from model$fit.

Given a workflow object, there is a function to extract the parsnip fit. There is a followup function that you can use to extract the original model fit from there.

TRUE. This is what is proposed in #161

tidymodels creates dummy variables from categorical variables. This could pose an issue identifying the underlying variable, and subsequently adding header rows.

I am not sure what the best way to add support for these objects, but I think it'll be a combination of updates to broom.helpers and gtsummary (passing a default tidier for these model types). I think this will be helpful to add, but to be honest, I am not sure I have the time right now. So we can just keep this open until one of us can get to it?

This is a good question. First of all, this is not an obligation. You can use tidymodels without converting factors into dummy variables. This is obligatory only for the engines/models that do not support factors like glmnet.

Second, if step_dummy() is applied, this is a numeric binary variable which is passed to the model and not a binary factor. As long as it has been transformed before modelling, should we assume that the user want several binary variables rather than a contrast? So far, when a numeric binary variable (e.g. trial$death) is passed to a model, we still assume it is a continuous variable and we do not add a reference level. See examples below.

Re-identifying how data was transformed before modelling would require working with a workflow object and not only with the model_fit. And it may be quite a challenge to reidentify all the transformations. However, I'm not sure that it would be a good idea. The data could have been transformed for many many reasons.

library(broom.helpers)
library(gtsummary)
#> 
#> Attachement du package : 'gtsummary'
#> Les objets suivants sont masqués depuis 'package:broom.helpers':
#> 
#>     all_continuous, all_contrasts
library(tidymodels)

trial$response <- factor(trial$response)

rec1 <- recipe(response ~ age + stage + death, data = trial)
rec2 <- rec1 %>% step_dummy(stage)

mod1 <- workflow(rec1, logistic_reg()) %>% 
  fit(trial) %>%
  extract_fit_parsnip()

mod2 <- workflow(rec2, logistic_reg()) %>% 
  fit(trial) %>%
  extract_fit_parsnip()

tidy_plus_plus(mod1)
#> # A tibble: 6 x 18
#>   term    variable var_label    var_class var_type    var_nlevels contrasts     
#>   <chr>   <chr>    <chr>        <chr>     <chr>             <int> <chr>         
#> 1 age     age      Age          numeric   continuous           NA <NA>          
#> 2 stageT1 stage    stage        factor    categorical           4 contr.treatme~
#> 3 stageT2 stage    stage        factor    categorical           4 contr.treatme~
#> 4 stageT3 stage    stage        factor    categorical           4 contr.treatme~
#> 5 stageT4 stage    stage        factor    categorical           4 contr.treatme~
#> 6 death   death    Patient Died integer   continuous           NA <NA>          
#> # ... with 11 more variables: contrasts_type <chr>, reference_row <lgl>,
#> #   label <chr>, n_obs <dbl>, n_event <dbl>, estimate <dbl>, std.error <dbl>,
#> #   statistic <dbl>, p.value <dbl>, conf.low <dbl>, conf.high <dbl>
tidy_plus_plus(mod2)
#> # A tibble: 5 x 18
#>   term     variable var_label    var_class var_type   var_nlevels contrasts
#>   <chr>    <chr>    <chr>        <chr>     <chr>            <int> <chr>    
#> 1 age      age      Age          numeric   continuous          NA <NA>     
#> 2 death    death    Patient Died integer   continuous          NA <NA>     
#> 3 stage_T2 stage_T2 stage_T2     numeric   continuous          NA <NA>     
#> 4 stage_T3 stage_T3 stage_T3     numeric   continuous          NA <NA>     
#> 5 stage_T4 stage_T4 stage_T4     numeric   continuous          NA <NA>     
#> # ... with 11 more variables: contrasts_type <chr>, reference_row <lgl>,
#> #   label <chr>, n_obs <dbl>, n_event <dbl>, estimate <dbl>, std.error <dbl>,
#> #   statistic <dbl>, p.value <dbl>, conf.low <dbl>, conf.high <dbl>

^{Created on 2022-06-21 by the reprex package (v2.0.1)}

Jun 21 '22 09:06 larmarange

Oy sorry I have so many things wrong about tidymodels 😆

I must have been thinking of mode_frame() in hardhat (also a part of tidymodels) and thought it was used everywhere https://hardhat.tidymodels.org/reference/model_frame.html
I'll take a look at the PR you created! Awesome! this is what I was doing in gtsummary (well, still am, but I"ll update gtsummary after broom.helpers is released).
OK, i think i misunderstood the default behaviour in workflows. I thought the default was to create dummy variables for all factors/characters with the default "blueprint". Here's how i was checking for it in gtsummary

#' @export
#' @rdname tbl_regression_methods
tbl_regression.model_fit <- function(x, ...) {
  message("Extracting {parsnip} model fit with `tbl_regression(x = x$fit, ...)`")
  tbl_regression(x = x$fit, ...)
}

#' @export
#' @rdname tbl_regression_methods
tbl_regression.workflow <- function(x, ...) {
  assert_package("workflows", "tbl_regression.workflow()")

  if (isTRUE(!x$pre$actions$formula$blueprint$indicators %in% "none")) {
    paste("To take full advantage of model formatting, e.g. grouping categorical",
          "variables, please add the following argument to the `workflows::add_model()` call:") %>%
      stringr::str_wrap() %>%
      paste("`blueprint = hardhat::default_formula_blueprint(indicators = 'none')`", sep = "\n") %>%
      paste("\n") %>%
      rlang::inform()
  }

  paste("Extracting {workflows} model fit with",
        "`workflows::extract_fit_parsnip(x) %>% tbl_regression(...)`") %>%
  message()

  tbl_regression(x = workflows::extract_fit_parsnip(x), ...)
}

FYI I am going to submit a gtsummary release this week. Perhaps for the next release we can coordinate for a unified experience for these workflows and parnsip objects

Jun 21 '22 12:06 ddsjoberg

FYI I am going to submit a gtsummary release this week. Perhaps for the next release we can coordinate for a unified experience for these workflows and parnsip objects

With pleasure and we can organize a quick zoom call if it is more convenient.

Regards

Jun 21 '22 13:06 larmarange

Closing it for now

Jul 27 '24 12:07 larmarange

broom.helpers broom.helpers copied to clipboard

Improve output for workflows/parsnip objects estimated with tidymodels

broom.helpers
broom.helpers copied to clipboard