software-review icon indicating copy to clipboard operation
software-review copied to clipboard

aorsf; accelerated oblique random survival forests

Open bcjaeger opened this issue 2 years ago • 85 comments

Submitting Author Name: Byron C Jaeger Submitting Author Github Handle: @bcjaeger Other Package Authors Github handles: (comma separated, delete if none) @nmpieyeskey, @sawyerWeld Repository: https://github.com/bcjaeger/aorsf Version submitted: 0.0.0.9000 Submission type: Stats Badge grade: gold Editor: @tdhock Reviewers: @chjackson, @mnwright, @jemus42

Due date for @chjackson: 2022-07-29

Due date for @mnwright: 2022-09-21 Due date for @jemus42: 2022-09-21 Archive: TBD Version accepted: TBD Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Package: aorsf
Title: Accelerated Oblique Random Survival Forests
Version: 0.0.0.9000
Authors@R: c(
    person(given = "Byron",
           family = "Jaeger",
           role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "0000-0001-7399-2299")),
    person(given = "Nicholas",  family = "Pajewski", role = "ctb"),
    person(given = "Sawyer", family = "Welden", role = "ctb", email = "[email protected]")
    )
Description: Fit, interpret, and make predictions with oblique random
    survival forests. Oblique decision trees are notoriously slow compared
    to their axis based counterparts, but 'aorsf' runs as fast or faster than 
    axis-based decision tree algorithms for right-censored time-to-event 
    outcomes.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
LinkingTo: 
    Rcpp,
    RcppArmadillo
Imports: 
    table.glue,
    Rcpp,
    data.table
URL: https://github.com/bcjaeger/aorsf,
    https://bcjaeger.github.io/aorsf
BugReports: https://github.com/bcjaeger/aorsf/issues
Depends: 
    R (>= 3.6)
Suggests: 
    survival,
    survivalROC,
    ggplot2,
    testthat (>= 3.0.0),
    knitr,
    rmarkdown,
    glmnet,
    covr,
    units
Config/testthat/edition: 3
VignetteBuilder: knitr

Pre-submission Inquiry

  • [X] A pre-submission inquiry has been approved in issue#525

General Information

  • Who is the target audience and what are scientific applications of this package?

Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.

Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.

  • Paste your responses to our General Standard G1.1 here, describing whether your software is:

    • An improvement on other implementations of similar algorithms in R.

    Please include hyperlinked references to all other relevant software. The obliqueRSF package was the original R package for oblique random forests. I wrote it and it is very slow. I wrote aorsf because I had multiple ideas about how to make obliqueRSF faster, specifically using a partial Newton Raphson algorithm instead of using penalized regression to derive linear combinations of variables in decision nodes. It would have been possible to rewrite obliqueRSF, but it would have been difficult to make the re-write backwards compatible with the version of obliqueRSF on CRAN.

  • (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable

Badging

Gold

  1. Compliance with a good number of standards beyond those identified as minimally necessary. aorsf complies with over 100 combined standards in the general and ML categories.
  2. Demonstrating excellence in compliance with multiple standards from at least two broad sub-categories. See 1. above
  3. Internal aspects of package structure and design. aorsf uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in the survival package (see tests in aorsf) while running at least twice as fast (thanks to Rcpparmadillo).

Technical checks

Confirm each of the following by checking the box.

I think aorsf is passing autotest and srr_stats_pre_submit(). I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).

This package:

Publication options

  • [X] Do you intend for this package to go on CRAN?
  • [ ] Do you intend for this package to go on Bioconductor?

Code of conduct

  • [X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

bcjaeger avatar Apr 28 '22 00:04 bcjaeger

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot avatar Apr 28 '22 00:04 ropensci-review-bot

:rocket:

The following problem was found in your submission template:

  • 'statsgrade' variable must be one of [bronze, silver, gold] Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

:wave:

ropensci-review-bot avatar Apr 28 '22 00:04 ropensci-review-bot

🚀

The following problem was found in your submission template:

  • 'statsgrade' variable must be one of [bronze, silver, gold] Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

Just updated to fix this. I'm not sure if reviewers will think that 'gold' is the right statsgrade, but might as well aim high.

bcjaeger avatar Apr 28 '22 00:04 bcjaeger

Checks for aorsf (v0.0.0.9000)

git hash: c73bb98c

  • :heavy_check_mark: Package name is available
  • :heavy_check_mark: has a 'codemeta.json' file.
  • :heavy_check_mark: has a 'contributing' file.
  • :heavy_check_mark: uses 'roxygen2'.
  • :heavy_check_mark: 'DESCRIPTION' has a URL field.
  • :heavy_check_mark: 'DESCRIPTION' has a BugReports field.
  • :heavy_check_mark: Package has at least one HTML vignette
  • :heavy_check_mark: All functions have examples.
  • :heavy_check_mark: Package has continuous integration checks.
  • :heavy_check_mark: Package coverage is 97.1%.
  • :heavy_check_mark: R CMD check found no errors.
  • :heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Machine Learning

:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package (102 complied with; 56 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 341
internal aorsf 152
internal utils 13
internal stats 11
internal methods 1
imports table.glue 3
imports Rcpp NA
imports data.table NA
suggests glmnet 1
suggests survival NA
suggests survivalROC NA
suggests ggplot2 NA
suggests testthat NA
suggests knitr NA
suggests rmarkdown NA
suggests covr NA
suggests units NA
linking_to Rcpp NA
linking_to RcppArmadillo NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

attr (40), names (35), for (23), length (21), c (20), paste (16), list (14), paste0 (12), mode (10), seq_along (10), vector (10), which (9), as.matrix (8), rep (7), as.integer (6), drop (6), seq (6), switch (6), order (5), match (4), min (4), setdiff (4), colnames (3), inherits (3), ncol (3), Sys.time (3), all (2), any (2), as.factor (2), cbind (2), data.frame (2), grepl (2), lapply (2), levels (2), matrix (2), nchar (2), nrow (2), rev (2), rle (2), row.names (2), sum (2), suppressWarnings (2), try (2), all.vars (1), as.data.frame (1), class (1), deparse (1), formals (1), grep (1), if (1), ifelse (1), intersect (1), is.na (1), max (1), mean (1), print (1), return (1), rownames (1), sapply (1), t (1), typeof (1), unique (1)

aorsf

paste_collapse (8), fctr_info (5), get_fctr_info (5), get_names_x (5), f_oobag_eval (3), get_n_obs (3), get_n_tree (3), get_names_y (3), get_numeric_bounds (3), last_value (3), orsf_fit (3), ref_code (3), unit_info (3), check_var_types (2), get_f_oobag_eval (2), get_importance (2), get_leaf_min_events (2), get_leaf_min_obs (2), get_max_time (2), get_mtry (2), get_n_events (2), get_n_leaves_mean (2), get_oobag_eval_every (2), get_type_oobag_eval (2), get_unit_info (2), is_empty (2), list_init (2), orsf_control_net (2), orsf_pd_summary (2), orsf_train_ (2), select_cols (2), check_arg_bound (1), check_arg_gt (1), check_arg_gteq (1), check_arg_is (1), check_arg_is_integer (1), check_arg_is_valid (1), check_arg_length (1), check_arg_lt (1), check_arg_lteq (1), check_arg_type (1), check_arg_uni (1), check_control_cph (1), check_control_net (1), check_new_data_fctrs (1), check_new_data_names (1), check_new_data_types (1), check_oobag_fun (1), check_orsf_inputs (1), check_pd_inputs (1), check_predict (1), check_units (1), contains_oobag (1), contains_vi (1), f_beta (1), fctr_check (1), fctr_check_levels (1), fctr_id_check (1), get_cph_do_scale (1), get_cph_eps (1), get_cph_iter_max (1), get_cph_method (1), get_cph_pval_max (1), get_f_beta (1), get_n_retry (1), get_n_split (1), get_net_alpha (1), get_net_df_target (1), get_oobag_pred (1), get_oobag_time (1), get_orsf_type (1), get_split_min_events (1), get_split_min_obs (1), get_tree_seeds (1), get_types_x (1), insert_vals (1), is_aorsf (1), is_error (1), is_trained (1), leaf_kaplan_testthat (1), lrt_multi_testthat (1), newtraph_cph_testthat (1), oobag_c_harrell_testthat (1), orsf (1), orsf_control_cph (1), orsf_oob_vi (1), orsf_pd_ (1), orsf_pd_ice (1), orsf_pred_multi (1), orsf_pred_uni (1), orsf_scale_cph (1), orsf_summarize_uni (1), orsf_time_to_train (1), orsf_train (1), orsf_unscale_cph (1), orsf_vi_ (1), x_node_scale_exported (1)

utils

data (13)

stats

formula (4), dt (2), terms (2), family (1), time (1), weights (1)

table.glue

round_spec (1), round_using_magnitude (1), table_value (1)

glmnet

glmnet (1)

methods

new (1)


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (48% in 2 files) and R (52% in 21 files)
  • 1 authors
  • 3 vignettes
  • 1 internal data file
  • 3 imported packages
  • 15 exported functions (median 15 lines of code)
  • 216 non-exported functions in R (median 3 lines of code)
  • 48 R functions (median 38 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 21 82.3
files_src 2 79.1
files_vignettes 3 92.4
files_tests 18 95.7
loc_R 2139 85.4
loc_src 1982 75.6
loc_vignettes 342 67.9
loc_tests 1532 91.7
num_vignettes 3 94.2
data_size_total 9034 70.4
data_size_median 9034 78.5
n_fns_r 231 91.6
n_fns_r_exported 15 58.5
n_fns_r_not_exported 216 94.0
n_fns_src 48 66.0
n_fns_per_file_r 7 77.6
n_fns_per_file_src 24 94.8
num_params_per_fn 3 33.6
loc_per_fn_r 3 1.1 TRUE
loc_per_fn_r_exp 15 35.6
loc_per_fn_r_not_exp 3 1.5 TRUE
loc_per_fn_src 38 88.9
rel_whitespace_R 50 96.1 TRUE
rel_whitespace_src 58 89.6
rel_whitespace_vignettes 56 83.5
rel_whitespace_tests 45 96.8 TRUE
doclines_per_fn_exp 46 58.1
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 364 93.5

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice and other checks (click to open)

3a. Continuous Integration Badges

R-CMD-check pkgcheck

GitHub Workflow Results

name conclusion sha date
Commands skipped 069021 2022-04-22
pages build and deployment success c2efe9 2022-04-28
pkgcheck success c73bb9 2022-04-28
pkgdown success c73bb9 2022-04-28
R-CMD-check success c73bb9 2022-04-28
test-coverage success c73bb9 2022-04-28

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following note:

  1. checking installed package size ... NOTE installed size is 6.9Mb sub-directories of 1Mb or more: libs 6.0Mb

R CMD check generated the following check_fails:

  1. no_import_package_as_a_whole
  2. rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 97.13

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
orsf 33
check_orsf_inputs 28
orsf_pd_ 22
ref_code 17
check_new_data_names 15
check_predict 15

Static code analyses with lintr

lintr found the following 223 potential issues:

message number of times
Avoid using sapply, consider vapply instead, that's type safe 10
Lines should not be more than 80 characters. 182
Use <-, not =, for assignment. 31

Package Versions

package version
pkgstats 0.0.4.30
pkgcheck 0.0.3.11
srr 0.0.1.149

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

ropensci-review-bot avatar Apr 28 '22 10:04 ropensci-review-bot

@ropensci-review-bot assign @tdhock as editor

jooolia avatar May 04 '22 18:05 jooolia

Assigned! @tdhock is now the editor

ropensci-review-bot avatar May 04 '22 18:05 ropensci-review-bot

@ropensci-review-bot seeking reviewers

tdhock avatar May 09 '22 17:05 tdhock

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

ropensci-review-bot avatar May 09 '22 17:05 ropensci-review-bot

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

Done! Looking forward to the review.

bcjaeger avatar May 10 '22 18:05 bcjaeger

Hi @tdhock,

Thank you for acting as the editor of this submission. I have recently added a paper.md file to the repository in hopes that aorsf can be eligible for an expedited review at the Journal of Open Source Software after the ropensci review. Can you let me know if any other files are needed or if other actions are needed from me to qualify for an expedited review at JOSS? Thank you!

bcjaeger avatar May 16 '22 11:05 bcjaeger

I'm not sure about the expedited JOSS review. You may ask @noamross or @mpadge

tdhock avatar May 16 '22 17:05 tdhock

@bcjaeger The paper.md file should be sufficient. For JOSS submission, after RO review is complete you should submit to JOSS and link to this review thread in the pre-review JOSS thread. JOSS editors will review the paper.md and can opt to use RO's software reviews rather than finding additional reviewers.

noamross avatar May 16 '22 17:05 noamross

Thanks, @noamross! That makes sense.

bcjaeger avatar May 16 '22 20:05 bcjaeger

Hi @tdhock, I see we are still seeking reviewers. Is there anything I can do to help find reviewers for this submission?

bcjaeger avatar Jun 07 '22 12:06 bcjaeger

I have asked a few people to review but no one has agreed yet, do you have any idea for potential reviewers to ask?

tdhock avatar Jun 14 '22 19:06 tdhock

Thanks! Assuming folks in the ROpenSci circle have been asked already, I'll offer some names (username) that may not have been asked yet but would be good reviewers for aorsf.

Terry Therneau (therneau) would be a good reviewer - a lot of the C code in aorsf is based on his survival package coxph routine.

Torsten Hothorn (thothorn) would also be a good reviewer - Torsten is the author of the party package and I'd like aorsf to look like the party package in 10 years.

Hannah Frick (hfrick), Emil Hvitfeldt (EmilHvitfeldt), Max Kuhn (topepo), Davis Vaughan (DavisVaughan), and Julia Silge (juliasilge) would all be good reviewers - they are all developers/contributors to the censored package, and I'd like aorsf to contribute to that package.

Raphael Sonabend (RaphaelS1), Andreas Bender (adibender), Michel Lang (mllg), and Patrick Schratz (pat-s) would all be good reviewers - they are developers/contributors to the mlr3-proba package, and I'd like aorsf to contribute to that package.

bcjaeger avatar Jun 16 '22 11:06 bcjaeger

Hi @tdhock, I am thinking about reaching out to the folks I listed in my last comment. Before I try to reach them, I was wondering if you had already contacted them? If you have, I will not bother them again with a review request.

bcjaeger avatar Jul 02 '22 02:07 bcjaeger

Hi @bcjaeger actually I have not asked any of those reviewers yet, sorry! I just got back to the office from traveling. Yes that would be helpful if you could ask them to review. (although it is suggested to not use more than one of them, https://devguide.ropensci.org/editorguide.html?q=reviewers#where-to-look-for-reviewers)

tdhock avatar Jul 06 '22 17:07 tdhock

Hi @tdhock, thanks! I will reach out to the folks in my earlier post and let you know if anyone is willing to review aorsf. Hope you had a great trip!

bcjaeger avatar Jul 06 '22 17:07 bcjaeger

@ropensci-review-bot add @chjackson to reviewers

tdhock avatar Jul 08 '22 17:07 tdhock

@chjackson added to the reviewers list. Review due date is 2022-07-29. Thanks @chjackson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot avatar Jul 08 '22 17:07 ropensci-review-bot

@chjackson: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot avatar Jul 08 '22 17:07 ropensci-review-bot

@tdhock Is this the right link for the reviewer guide: (https://devguide.ropensci.org/reviewerguide.html). The link in the bot post gives a 404.

chjackson avatar Jul 08 '22 17:07 chjackson

@nicholasjhorton yes thank you very much for agreeing to review.

tdhock avatar Jul 08 '22 17:07 tdhock

@nicholasjhorton and @chjackson, thank you for agreeing to review!! I am thrilled and looking forward to talking about aorsf with you.

bcjaeger avatar Jul 09 '22 11:07 bcjaeger

I'm not sure that I was asked to review. Can you please resend the invite @tdhock?

nicholasjhorton avatar Jul 09 '22 16:07 nicholasjhorton

@ropensci-review-bot add @nicholasjhorton to reviewers

tdhock avatar Jul 11 '22 17:07 tdhock

@nicholasjhorton added to the reviewers list. Review due date is 2022-08-01. Thanks @nicholasjhorton for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot avatar Jul 11 '22 17:07 ropensci-review-bot

@nicholasjhorton: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot avatar Jul 11 '22 17:07 ropensci-review-bot

Package Review

  • Briefly describe any working relationship you have (had) with the package authors.

    • None
  • ☒ As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README

  • Installation instructions: for the development version of package and any non-standard dependencies in README

  • Vignette(s): demonstrating major functionality that runs successfully locally

  • Function Documentation: for all exported functions

    • I built and installed the github package locally, and did help(package="aorsf"). I couldn’t see a help("aorsf-package") from R, is this because it it labelled with keyword “internal”? I don’t know if a package overview help page is strictly necessary given the DESCRIPTION and README files, but I sometimes find them helpful (is this the “multiple points of entry” principle?).
  • Examples: (that run successfully locally) for all exported functions

    • There are currently no examples for orsf or orsf_train on the website, but these appeared in the local R help. All the others ran with no problem.
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

    • I couldn’t find anything about community / contribution guidelines.

Functionality

  • Installation: Installation succeeds as documented.

  • Functionality: Any functional claims of the software been confirmed.

  • Performance: Any performance claims of the software been confirmed.

    • Given my lack of machine learning experience I don’t think I’m qualified to judge relative peformance of different ML packages - I wouldn’t spot if they were using different assumptions that would cloud the comparison.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

    • Looks fairly comprehensive on a quick look through, and they passed.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

    • I didn’t have time to go through every single point made in these guidelines, but what I saw of the package made me confident that it does what a well-organised and accessible package should do.

Estimated hours spent reviewing: 5

  • ☒ Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer (“rev” role) in the package DESCRIPTION file.

Review Comments

While I have a lot of experience with parametric statistical modelling of survival data, I am not knowledgeable about machine learning. So I can’t go into depth on the ML methods side of things. Though the vignettes explained clearly what the method is supposed to do - I felt like I learnt some useful things from them.

As far as I can tell, this is a polished, easily accessible package that implements a widely useful method. The differences and advances from other packages and methods are explained concisely.

I tried the package out by using it on a survival dataset that I had been using to test some of my own work, the colon data from the survival package. This is a simple randomised trial that compares time to cancer recurrence or death for three treatment groups, and I considered treatment as the only predictor. This may not be an ideal application for a random forest model, but hopefully it is helpful to see how the method does on this case.

library(survival)
library(dplyr)
colonc <- colon |> 
 filter(etype == 1) |>
 mutate(lev = factor(rx=="Lev"),
        lev5fu = factor(rx=="Lev+5FU"))

When using orsf with a single factor predictor, I got "Error: formula must have at least 2 predictors". Perhaps it would be helpful to explain the restriction to 2+ predictors somewhere, though I guess that it is rare that people will want to use a machine learning method with only one predictor.

fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ rx)

When I recoded the data so the formula included two binary predictors, it worked. Though I realise that in this case the data contain only three out of the four possible combinations of the two binary variables, so it is impossible to identify whether there is an interaction.

I then calculated predicted survival at 2000 and 3000 days, under the two methods, for each of the four combinations of predictors.

library(aorsf)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu, n_tree = 100, n_split=3)
pd_spec <- list(lev=c("FALSE","TRUE"), 
                lev5fu=c("FALSE","TRUE"))
pd_data <- orsf_pd_summary(object = fit, pd_spec = pd_spec,
                           pred_horizon = c(2000, 3000),
                           oobag = TRUE, expand_grid = TRUE, 
                           risk=FALSE)
pd_data

##    pred_horizon   lev lev5fu      mean       lwr      medn       upr
## 1:         2000 FALSE  FALSE 0.4511464 0.4457626 0.4510625 0.4572469
## 2:         2000  TRUE  FALSE 0.4496973 0.4451570 0.4496190 0.4549601
## 3:         2000 FALSE   TRUE 0.6138173 0.6072743 0.6138223 0.6205446
## 4:         2000  TRUE   TRUE 0.5574858 0.5362698 0.5575599 0.5803773
## 5:         3000 FALSE  FALSE 0.4239031 0.4180159 0.4237805 0.4305380
## 6:         3000  TRUE  FALSE 0.4224044 0.4174938 0.4224400 0.4274585
## 7:         3000 FALSE   TRUE 0.6016161 0.5949703 0.6015791 0.6084788
## 8:         3000  TRUE   TRUE 0.5405302 0.5183706 0.5406791 0.5655176

I compared the osrf model to a flexible parametric spline-based, non-proportional hazards survival model from the flexsurv package.

library(flexsurv)
spl <- flexsurvspline(Surv(time, status) ~ lev + lev5fu + gamma1(lev) + gamma1(lev5fu), data=colonc, k=5)
summary(spl, newdata=expand.grid(pd_spec), t=c(2000,3000), tidy=TRUE) |> 
 arrange(time, lev5fu, lev)

##   time       est       lcl       ucl   lev lev5fu
## 1 2000 0.4418226 0.3930134 0.4876635 FALSE  FALSE
## 2 2000 0.4495771 0.3991124 0.4943800  TRUE  FALSE
## 3 2000 0.6143594 0.5584856 0.6625904 FALSE   TRUE
## 4 2000 0.6207677 0.5397314 0.6971653  TRUE   TRUE
## 5 3000 0.4084440 0.3582102 0.4573942 FALSE  FALSE
## 6 3000 0.4186286 0.3673111 0.4650971  TRUE  FALSE
## 7 3000 0.5871624 0.5340537 0.6394870 FALSE   TRUE
## 8 3000 0.5958252 0.5095980 0.6783956  TRUE   TRUE

The predictions agreed for each combination except the one with both factor levels TRUE. This is understandable because these combinations do not appear in the data, and the two methods will be relying on different assumptions to extrapolate to this combination.

At this point I wondered how the intervals in the output were determined, and what do they mean? Do they reflect uncertainty about an expected value, or variability in observed values, or something else? The help page (argument prob_values) explains them as quantiles, but of what? It can’t be quantiles of some subset of the data with that combination of predictors, as there isn’t any such data in the training set. Perhaps this is obvious to someone with more experience with random forests or related ML methods.

Observations from the Introduction vignette

  • The y axis for “partial dependence of bilirubin and edema” should start at 0. I’d also end it at 1, but this is perhaps only necessary if the graph is intended to be compared with other graphs of predicted risk.

  • It wasn’t clear what the package was doing when the pred_horizon argument was not supplied to the functions that did prediction. What horizon is being used? I found that orsf_pd_summary didn’t show in the output what prediction time was being used, in cases where the prediction time wasn’t supplied by the user. Though from orsf_summarize_uni, I deduced this was the median follow-up time in the data. I’d recommend showing good practice in the worked examples, e.g. using a meaningful prediction time that someone doing the analysis might use (e.g. 5 years). As a developer, I’ve found that it’s common for users to do silly things by copying from worked examples in manuals without thinking.

chjackson avatar Jul 18 '22 16:07 chjackson