broom.helpers icon indicating copy to clipboard operation
broom.helpers copied to clipboard

Improve output for workflows/parsnip objects estimated with tidymodels

Open ddsjoberg opened this issue 2 years ago • 6 comments

The tidymodels objects don't have the same 'terms' information in a standard model. But I do think all this information is stored in the object. Like I mentioned before, I am no tidymodels expert, but here I some tidbits I think are helpful for tackling this issue.

  1. workflows/parnsip models don't use model.frame() and model.matrix(). Rather they use model_frame() and model_matrix() that do not remove missing data from the return data frame/matrix.
  2. I think there is something similar for terms?
  3. I think parsnip has it's own tidiers? Not sure if it's full featured with all models supported? Not sure how workflows fall into this.
  4. Given a workflow object, there is a function to extract the parsnip fit. There is a followup function that you can use to extract the original model fit from there.
  5. tidymodels creates dummy variables from categorical variables. This could pose an issue identifying the underlying variable, and subsequently adding header rows.

I am not sure what the best way to add support for these objects, but I think it'll be a combination of updates to broom.helpers and gtsummary (passing a default tidier for these model types). I think this will be helpful to add, but to be honest, I am not sure I have the time right now. So we can just keep this open until one of us can get to it?

ddsjoberg avatar Jun 20 '22 00:06 ddsjoberg

I do not see any model_frame() or model_matrix() functions in tidymodels.

However, it seems that the original output of the modelling functions is kept in model_fit objects and could be used by broom.helpers.

library(broom.helpers)
library(gtsummary)
#> 
#> Attachement du package : 'gtsummary'
#> Les objets suivants sont masqués depuis 'package:broom.helpers':
#> 
#>     all_continuous, all_contrasts
library(tidymodels)

trial$response <- factor(trial$response)

mod <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(response ~ age + stage + grade, data = trial)

class(mod)
#> [1] "_glm"      "model_fit"

original_fit <- mod$fit

tidy_plus_plus(original_fit) %>% knitr::kable()
term variable var_label var_class var_type var_nlevels contrasts contrasts_type reference_row label n_obs n_event estimate std.error statistic p.value conf.low conf.high
age age Age numeric continuous NA NA NA NA Age 183 58 0.0193570 0.0114933 1.6841945 0.0921441 -0.0028454 0.0424236
stageT1 stage T Stage factor categorical 4 contr.treatment treatment TRUE T1 50 18 0.0000000 NA NA NA NA NA
stageT2 stage T Stage factor categorical 4 contr.treatment treatment FALSE T2 51 13 -0.5676561 0.4432868 -1.2805618 0.2003476 -1.4535410 0.2939708
stageT3 stage T Stage factor categorical 4 contr.treatment treatment FALSE T3 39 14 -0.0961995 0.4570279 -0.2104893 0.8332858 -1.0030772 0.7976585
stageT4 stage T Stage factor categorical 4 contr.treatment treatment FALSE T4 43 13 -0.2679732 0.4536436 -0.5907130 0.5547127 -1.1710339 0.6171288
gradeI grade Grade factor categorical 3 contr.treatment treatment TRUE I 65 21 0.0000000 NA NA NA NA NA
gradeII grade Grade factor categorical 3 contr.treatment treatment FALSE II 58 17 -0.1731542 0.4025511 -0.4301422 0.6670922 -0.9709344 0.6143866
gradeIII grade Grade factor categorical 3 contr.treatment treatment FALSE III 60 20 0.0443406 0.3889227 0.1140087 0.9092309 -0.7215734 0.8092799

Created on 2022-06-20 by the reprex package (v2.0.1)

larmarange avatar Jun 20 '22 08:06 larmarange

We could probably support parsnip models whose engines are already supported by broom.helpers

larmarange avatar Jun 20 '22 08:06 larmarange

Regarding workflow objects, we can use extract_fit_parsnip() to get the model_fit object. We have to decide if we should support workflow object or if it should be the responsibility of the user to extract the model_fit.

larmarange avatar Jun 20 '22 08:06 larmarange

The tidymodels objects don't have the same 'terms' information in a standard model. But I do think all this information is stored in the object. Like I mentioned before, I am no tidymodels expert, but here I some tidbits I think are helpful for tackling this issue.

parsnip is just a wrapper. Therefore, it depends of the engine used and the type of model.

  1. workflows/parnsip models don't use model.frame() and model.matrix(). Rather they use model_frame() and model_matrix() that do not remove missing data from the return data frame/matrix.

I didn't find any generic model_frame() or model_matrix(). Are you sure it is part of tidymodels?

  1. I think there is something similar for terms?
  1. I think parsnip has it's own tidiers? Not sure if it's full featured with all models supported? Not sure how workflows fall into this.

It seems featured for all models covered by parsnip. With the current proposal in #161, we use the parsnip tidier while extracting the other information from model$fit.

  1. Given a workflow object, there is a function to extract the parsnip fit. There is a followup function that you can use to extract the original model fit from there.

TRUE. This is what is proposed in #161

  1. tidymodels creates dummy variables from categorical variables. This could pose an issue identifying the underlying variable, and subsequently adding header rows.

I am not sure what the best way to add support for these objects, but I think it'll be a combination of updates to broom.helpers and gtsummary (passing a default tidier for these model types). I think this will be helpful to add, but to be honest, I am not sure I have the time right now. So we can just keep this open until one of us can get to it?

This is a good question. First of all, this is not an obligation. You can use tidymodels without converting factors into dummy variables. This is obligatory only for the engines/models that do not support factors like glmnet.

Second, if step_dummy() is applied, this is a numeric binary variable which is passed to the model and not a binary factor. As long as it has been transformed before modelling, should we assume that the user want several binary variables rather than a contrast? So far, when a numeric binary variable (e.g. trial$death) is passed to a model, we still assume it is a continuous variable and we do not add a reference level. See examples below.

Re-identifying how data was transformed before modelling would require working with a workflow object and not only with the model_fit. And it may be quite a challenge to reidentify all the transformations. However, I'm not sure that it would be a good idea. The data could have been transformed for many many reasons.

library(broom.helpers)
library(gtsummary)
#> 
#> Attachement du package : 'gtsummary'
#> Les objets suivants sont masqués depuis 'package:broom.helpers':
#> 
#>     all_continuous, all_contrasts
library(tidymodels)

trial$response <- factor(trial$response)

rec1 <- recipe(response ~ age + stage + death, data = trial)
rec2 <- rec1 %>% step_dummy(stage)

mod1 <- workflow(rec1, logistic_reg()) %>% 
  fit(trial) %>%
  extract_fit_parsnip()

mod2 <- workflow(rec2, logistic_reg()) %>% 
  fit(trial) %>%
  extract_fit_parsnip()

tidy_plus_plus(mod1)
#> # A tibble: 6 x 18
#>   term    variable var_label    var_class var_type    var_nlevels contrasts     
#>   <chr>   <chr>    <chr>        <chr>     <chr>             <int> <chr>         
#> 1 age     age      Age          numeric   continuous           NA <NA>          
#> 2 stageT1 stage    stage        factor    categorical           4 contr.treatme~
#> 3 stageT2 stage    stage        factor    categorical           4 contr.treatme~
#> 4 stageT3 stage    stage        factor    categorical           4 contr.treatme~
#> 5 stageT4 stage    stage        factor    categorical           4 contr.treatme~
#> 6 death   death    Patient Died integer   continuous           NA <NA>          
#> # ... with 11 more variables: contrasts_type <chr>, reference_row <lgl>,
#> #   label <chr>, n_obs <dbl>, n_event <dbl>, estimate <dbl>, std.error <dbl>,
#> #   statistic <dbl>, p.value <dbl>, conf.low <dbl>, conf.high <dbl>
tidy_plus_plus(mod2)
#> # A tibble: 5 x 18
#>   term     variable var_label    var_class var_type   var_nlevels contrasts
#>   <chr>    <chr>    <chr>        <chr>     <chr>            <int> <chr>    
#> 1 age      age      Age          numeric   continuous          NA <NA>     
#> 2 death    death    Patient Died integer   continuous          NA <NA>     
#> 3 stage_T2 stage_T2 stage_T2     numeric   continuous          NA <NA>     
#> 4 stage_T3 stage_T3 stage_T3     numeric   continuous          NA <NA>     
#> 5 stage_T4 stage_T4 stage_T4     numeric   continuous          NA <NA>     
#> # ... with 11 more variables: contrasts_type <chr>, reference_row <lgl>,
#> #   label <chr>, n_obs <dbl>, n_event <dbl>, estimate <dbl>, std.error <dbl>,
#> #   statistic <dbl>, p.value <dbl>, conf.low <dbl>, conf.high <dbl>

Created on 2022-06-21 by the reprex package (v2.0.1)

larmarange avatar Jun 21 '22 09:06 larmarange

Oy sorry I have so many things wrong about tidymodels 😆

  1. I must have been thinking of mode_frame() in hardhat (also a part of tidymodels) and thought it was used everywhere https://hardhat.tidymodels.org/reference/model_frame.html

  2. I'll take a look at the PR you created! Awesome! this is what I was doing in gtsummary (well, still am, but I"ll update gtsummary after broom.helpers is released).

  3. OK, i think i misunderstood the default behaviour in workflows. I thought the default was to create dummy variables for all factors/characters with the default "blueprint". Here's how i was checking for it in gtsummary

#' @export
#' @rdname tbl_regression_methods
tbl_regression.model_fit <- function(x, ...) {
  message("Extracting {parsnip} model fit with `tbl_regression(x = x$fit, ...)`")
  tbl_regression(x = x$fit, ...)
}

#' @export
#' @rdname tbl_regression_methods
tbl_regression.workflow <- function(x, ...) {
  assert_package("workflows", "tbl_regression.workflow()")

  if (isTRUE(!x$pre$actions$formula$blueprint$indicators %in% "none")) {
    paste("To take full advantage of model formatting, e.g. grouping categorical",
          "variables, please add the following argument to the `workflows::add_model()` call:") %>%
      stringr::str_wrap() %>%
      paste("`blueprint = hardhat::default_formula_blueprint(indicators = 'none')`", sep = "\n") %>%
      paste("\n") %>%
      rlang::inform()
  }

  paste("Extracting {workflows} model fit with",
        "`workflows::extract_fit_parsnip(x) %>% tbl_regression(...)`") %>%
  message()

  tbl_regression(x = workflows::extract_fit_parsnip(x), ...)
}

FYI I am going to submit a gtsummary release this week. Perhaps for the next release we can coordinate for a unified experience for these workflows and parnsip objects

ddsjoberg avatar Jun 21 '22 12:06 ddsjoberg

FYI I am going to submit a gtsummary release this week. Perhaps for the next release we can coordinate for a unified experience for these workflows and parnsip objects

With pleasure and we can organize a quick zoom call if it is more convenient.

Regards

larmarange avatar Jun 21 '22 13:06 larmarange

Closing it for now

larmarange avatar Jul 27 '24 12:07 larmarange