gtsummary icon indicating copy to clipboard operation
gtsummary copied to clipboard

Discussions around a restructure

Open larmarange opened this issue 3 years ago • 8 comments

Sometimes, when using a by argument with tbl_summary(), it could be relevant to have an overall row at the beginning or the end of the table (similar to what already exists in tbl_custom_summary().

Shoud we have a method add_overall_row()? It will display the equivalent of displaying a variable where all values would be equal to TRUE (i.e. as a dichotomous var)

larmarange avatar Oct 24 '22 18:10 larmarange

So it would be equivalent to the total row in tbl_cross()

larmarange avatar Oct 24 '22 18:10 larmarange

Sounds interesting!

I've been contemplating other changes that this could work well with.

  • I think gtsummary is great, but it can definitely be faster.
  • I was thinking a re-write of many of the core functions could speed things up. For example, focusing on speed improvements in tbl_custom_summary(), and then letting this be the "powerhouse" function that is used in the background for tbl_summary(), tbl_cross(), tbl_continuous(), etc etc.
  • I've been chatting with the maintainer of gt about speed improvements and he recommended the {profvis} to identify pieces of the code base that take a long time to run.
  • The recent changes in our upstream dependencies have been disheartening (aka annoying) to deal with...so much work on our side due to deprecations of upstream packages.
  • As a part of the rewrite, I also want to remove many dependencies. I am hoping that by doing this, the maintenance of the package in the future will be far more minimal
  • I also want to move all non-core functionality out of gtsummary. For example, we have so many packages in "Suggests" because we export tidiers for GAMs, MICE models, generic Wald tests, and so much more. All these dependencies make maintenance difficult. A clean, streamlined gtsummary package will be easier to maintain.

That is a lot of work, but I do think it could be worth it in the long run and "future-proof" ourselves against other breaking changes. I would like to spend less time maintaining the package, and if more upfront work gets us there, I think it's worth it.

Anyway, with big changes coming (hopefully), incorporating a new class of functions like the one you're proposing would be good.

One last thing: Are you in any open Slack groups that we can both join? Sometimes I think it would be helpful to be able to instant message each other. If not, you can join rinpharma or pharmaverse Slack channels?

ddsjoberg avatar Oct 24 '22 19:10 ddsjoberg

Dear @ddsjoberg

It sounds very exciting.

  • I agree that it could be a good moment for a rewrite of core functions
  • working on tbl_custom_summary() is a good idea as many other functions could be derived from it
    • maybe some work to optimize summary for categorical variables but we already have some pieces for it
  • it would an opportunity for better mutualisation of the code and maybe a reorganisation of internal functions
  • add_overall_row feature is already covered by tbl_custom_summary()
  • we could develop just after tbl_custom_svysummary()
  • some optimisation will maybe have to be done in broom.helpers, for example .assert_package() (we could find a way to cache the computation of dependencies)
  • regarding custom tidiers and some non-core functionality for covering certain types of models, it could maybe be moved to broom.helpers (Suggest dependencies from broom.helpers are not installed by default when installing gtsummary)
  • do you have already any idea of where the code is the more slow?

Regarding Slack, I'm not using it at the moment. I'm using Microsot Teams with colleagues and WhatsApp quite a lot when I need quick feedback. But I can join a slack group if easier for you.

We can also organize a Zoom brainstorming session if and when it could be relevant.

larmarange avatar Oct 25 '22 17:10 larmarange

Quick question: should a rewrite be compatible with historical version of R or could it be developed only for R >= 4.2?

Just using the new pipe could save some time, cf. https://michaelbarrowman.co.uk/post/the-new-base-pipe/

larmarange avatar Oct 25 '22 17:10 larmarange

Cool! I think a brainstorming session would be fantastic. Let's set something up for mid-November? I like all your suggestions above!

We could perhaps release all these major updates as v2.0.0 with a few other changes too

  • gt is nearly ready for production with Word and RTF.
  • In this release, we could change the knit_print() methods to no longer conditionally use {flextable} or {knitr} depending on the output type.
  • We can use gt for all output types.

R Version: I am torn about which versions of R to support. Ideally, I would like to continue to support the last 4 releases like the tidyverse does. BUT I do NOT want to do this big revamp again, so maybe it's best to just use base pipe in the back end now. As more R versions are released, we can AIM to get 4 versions supported again as more versions are released. It would also be nice to use \(x) anonymous function as well. FWIW, the tidyverse has removed all references to the formula shortcut function notation, and I think we can follow suit: still support it for now, but do not advertise it.

Dependencies: I think a good goal is to get the dependencies could be reduced to:

    broom (>= 0.8.0),
    broom.helpers (>= 1.9.0),
    cli (>= 3.1.1),
    dplyr (>= 1.0.7),
    glue (>= 1.6.0),
    gt (>= 0.7.0),
    lifecycle (>= 1.0.1),
    rlang (>= 1.0.3),

I think it would be difficult (well, even more difficult) to eliminate any of the above deps. The freebase pkg implements

Error Messaging: I would like to adopt cli error messaging throughout the package. I need to read more on error classes though. They seem to be used heavily throughout dplyr, and could be useful for us too.

ddsjoberg avatar Oct 26 '22 00:10 ddsjoberg

I completely agree. For users with an older version of R, it would still be possible to install version 1 (remotes::insll_version()), in particular if we communicate properly.

A version 2 would also be the best place for any breaking change (if relevant).

It could be an opportunity to rediscuss the style_*() family of functions. In scales(), the label_*() family return labelled functions (and not directly format the numbers) so it is easier to call them in ggplot2. We have a similar issue here.

From 7 to 19 November, I will be traveling a lot for work. But happy for a call before or later.

larmarange avatar Oct 26 '22 07:10 larmarange

Sorry about getting in the middle of the conversation. Just wanted to enthusiastically support improving the performance of gtsummary, and add a couple comments.

I started moving from sjPlot to gtsummary a while ago, but gtsummary is so much slower that I end up using sjPlot for development, and only switch to gtsummary when everything else is ready. Just as en example, for a single table in one of my current projects, sjPlot takes .350s and gtsummary 3.2s. But I need to plot 24 tables, so ends up being 8 vs 70 seconds.

Regarding ways to sped up the code, in my experience, something that makes a big difference speed-wise for dplyr-heavy pipelines is using dtplyr. It generally just means adding a dtplyr::lazy_dt() |> at the beginning, and as_tibble() at the end. I know it goes against reducing dependencies, but the benefit may be quite substantial, for a minimal effort.

Below a very simple example of this. ONE takes ~ 3 times more than TWO. In more complex pipe chains, the gains can be even bigger:

microbenchmark::microbenchmark(

ONE = lme4::Arabidopsis |> 
  dplyr::mutate(computed_var = gen / rack,
         another_var = 
           dplyr::case_when(
             amd == "clipped" ~ 1,
             amd == "unclipped" ~ 0,
             TRUE ~ NA
           )) |> 
  dplyr::rename(population = popu),

TWO = lme4::Arabidopsis |> 
  dtplyr::lazy_dt() |> # Create a "lazy" data.table
  dplyr::mutate(computed_var = gen / rack,
         another_var = 
           dplyr::case_when(
             amd == "clipped" ~ 1,
             amd == "unclipped" ~ 0,
             TRUE ~ NA
           )) |> 
  dplyr::rename(population = popu) |> 
  dplyr::as_tibble() # Back to tibble

)
expr min lq mean median uq max neval cld
ONE 17.03125 17.284820 18.162854 17.49912 17.944565 30.31573 100 a
TWO 5.40582 5.536725 5.732491 5.63138 5.776665 8.15552 100 b

gorkang avatar Mar 17 '23 12:03 gorkang