software-review aorsf; accelerated oblique random survival forests

trafficstars

Submitting Author Name: Byron C Jaeger Submitting Author Github Handle: @bcjaeger Other Package Authors Github handles: (comma separated, delete if none) @nmpieyeskey, @sawyerWeld Repository: https://github.com/bcjaeger/aorsf Version submitted: 0.0.0.9000 Submission type: Stats Badge grade: gold Editor: @tdhock Reviewers: @chjackson, @mnwright, @jemus42

Due date for @chjackson: 2022-07-29

Due date for @mnwright: 2022-09-21 Due date for @jemus42: 2022-09-21 Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: aorsf
Title: Accelerated Oblique Random Survival Forests
Version: 0.0.0.9000
Authors@R: c(
    person(given = "Byron",
           family = "Jaeger",
           role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "0000-0001-7399-2299")),
    person(given = "Nicholas",  family = "Pajewski", role = "ctb"),
    person(given = "Sawyer", family = "Welden", role = "ctb", email = "[email protected]")
    )
Description: Fit, interpret, and make predictions with oblique random
    survival forests. Oblique decision trees are notoriously slow compared
    to their axis based counterparts, but 'aorsf' runs as fast or faster than 
    axis-based decision tree algorithms for right-censored time-to-event 
    outcomes.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
LinkingTo: 
    Rcpp,
    RcppArmadillo
Imports: 
    table.glue,
    Rcpp,
    data.table
URL: https://github.com/bcjaeger/aorsf,
    https://bcjaeger.github.io/aorsf
BugReports: https://github.com/bcjaeger/aorsf/issues
Depends: 
    R (>= 3.6)
Suggests: 
    survival,
    survivalROC,
    ggplot2,
    testthat (>= 3.0.0),
    knitr,
    rmarkdown,
    glmnet,
    covr,
    units
Config/testthat/edition: 3
VignetteBuilder: knitr

Pre-submission Inquiry

[X] A pre-submission inquiry has been approved in issue#525

General Information

Who is the target audience and what are scientific applications of this package?

Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.

Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.

Paste your responses to our General Standard G1.1 here, describing whether your software is:
- An improvement on other implementations of similar algorithms in R.
Please include hyperlinked references to all other relevant software. The obliqueRSF package was the original R package for oblique random forests. I wrote it and it is very slow. I wrote aorsf because I had multiple ideas about how to make obliqueRSF faster, specifically using a partial Newton Raphson algorithm instead of using penalized regression to derive linear combinations of variables in decision nodes. It would have been possible to rewrite obliqueRSF, but it would have been difficult to make the re-write backwards compatible with the version of obliqueRSF on CRAN.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable

Badging

What grade of badge are you aiming for? (bronze, silver, gold)

Gold

If aiming for silver or gold, describe which of the four aspects listed in the Guide for Authors chapter the package fulfils (at least one aspect for silver; three for gold)

Compliance with a good number of standards beyond those identified as minimally necessary. aorsf complies with over 100 combined standards in the general and ML categories.
Demonstrating excellence in compliance with multiple standards from at least two broad sub-categories. See 1. above
Internal aspects of package structure and design. aorsf uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in the survival package (see tests in aorsf) while running at least twice as fast (thanks to Rcpparmadillo).

Technical checks

Confirm each of the following by checking the box.

[X] I have read the rOpenSci packaging guide.
[X] I have read the author guide and I expect to maintain this package for at least 2 years or have another maintainer identified.
[X] I/we have read the Statistical Software Peer Review Guide for Authors.
[X] I/we have run autotest checks on the package, and ensured no tests fail.
[X] The srr_stats_pre_submit() function confirms this package may be submitted.
[X] The pkgcheck() function confirms this package may be submitted - alternatively, please explain reasons for any checks which your package is unable to pass.

I think aorsf is passing autotest and srr_stats_pre_submit(). I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.

Publication options

[X] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Apr 28 '22 00:04 bcjaeger

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

Apr 28 '22 00:04 ropensci-review-bot

:rocket:

The following problem was found in your submission template:

'statsgrade' variable must be one of [bronze, silver, gold] Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

:wave:

Apr 28 '22 00:04 ropensci-review-bot

🚀

The following problem was found in your submission template:

'statsgrade' variable must be one of [bronze, silver, gold] Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

Just updated to fix this. I'm not sure if reviewers will think that 'gold' is the right statsgrade, but might as well aim high.

Apr 28 '22 00:04 bcjaeger

Checks for aorsf (v0.0.0.9000)

git hash: c73bb98c

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_check_mark: Package coverage is 97.1%.
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE

1. rOpenSci Statistical Standards (`srr` package)

This package is in the following category:

Machine Learning

:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package (102 complied with; 56 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.

2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	341
internal	aorsf	152
internal	utils	13
internal	stats	11
internal	methods	1
imports	table.glue	3
imports	Rcpp	NA
imports	data.table	NA
suggests	glmnet	1
suggests	survival	NA
suggests	survivalROC	NA
suggests	ggplot2	NA
suggests	testthat	NA
suggests	knitr	NA
suggests	rmarkdown	NA
suggests	covr	NA
suggests	units	NA
linking_to	Rcpp	NA
linking_to	RcppArmadillo	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

attr (40), names (35), for (23), length (21), c (20), paste (16), list (14), paste0 (12), mode (10), seq_along (10), vector (10), which (9), as.matrix (8), rep (7), as.integer (6), drop (6), seq (6), switch (6), order (5), match (4), min (4), setdiff (4), colnames (3), inherits (3), ncol (3), Sys.time (3), all (2), any (2), as.factor (2), cbind (2), data.frame (2), grepl (2), lapply (2), levels (2), matrix (2), nchar (2), nrow (2), rev (2), rle (2), row.names (2), sum (2), suppressWarnings (2), try (2), all.vars (1), as.data.frame (1), class (1), deparse (1), formals (1), grep (1), if (1), ifelse (1), intersect (1), is.na (1), max (1), mean (1), print (1), return (1), rownames (1), sapply (1), t (1), typeof (1), unique (1)

aorsf

paste_collapse (8), fctr_info (5), get_fctr_info (5), get_names_x (5), f_oobag_eval (3), get_n_obs (3), get_n_tree (3), get_names_y (3), get_numeric_bounds (3), last_value (3), orsf_fit (3), ref_code (3), unit_info (3), check_var_types (2), get_f_oobag_eval (2), get_importance (2), get_leaf_min_events (2), get_leaf_min_obs (2), get_max_time (2), get_mtry (2), get_n_events (2), get_n_leaves_mean (2), get_oobag_eval_every (2), get_type_oobag_eval (2), get_unit_info (2), is_empty (2), list_init (2), orsf_control_net (2), orsf_pd_summary (2), orsf_train_ (2), select_cols (2), check_arg_bound (1), check_arg_gt (1), check_arg_gteq (1), check_arg_is (1), check_arg_is_integer (1), check_arg_is_valid (1), check_arg_length (1), check_arg_lt (1), check_arg_lteq (1), check_arg_type (1), check_arg_uni (1), check_control_cph (1), check_control_net (1), check_new_data_fctrs (1), check_new_data_names (1), check_new_data_types (1), check_oobag_fun (1), check_orsf_inputs (1), check_pd_inputs (1), check_predict (1), check_units (1), contains_oobag (1), contains_vi (1), f_beta (1), fctr_check (1), fctr_check_levels (1), fctr_id_check (1), get_cph_do_scale (1), get_cph_eps (1), get_cph_iter_max (1), get_cph_method (1), get_cph_pval_max (1), get_f_beta (1), get_n_retry (1), get_n_split (1), get_net_alpha (1), get_net_df_target (1), get_oobag_pred (1), get_oobag_time (1), get_orsf_type (1), get_split_min_events (1), get_split_min_obs (1), get_tree_seeds (1), get_types_x (1), insert_vals (1), is_aorsf (1), is_error (1), is_trained (1), leaf_kaplan_testthat (1), lrt_multi_testthat (1), newtraph_cph_testthat (1), oobag_c_harrell_testthat (1), orsf (1), orsf_control_cph (1), orsf_oob_vi (1), orsf_pd_ (1), orsf_pd_ice (1), orsf_pred_multi (1), orsf_pred_uni (1), orsf_scale_cph (1), orsf_summarize_uni (1), orsf_time_to_train (1), orsf_train (1), orsf_unscale_cph (1), orsf_vi_ (1), x_node_scale_exported (1)

utils

data (13)

stats

formula (4), dt (2), terms (2), family (1), time (1), weights (1)

table.glue

round_spec (1), round_using_magnitude (1), table_value (1)

glmnet

glmnet (1)

methods

new (1)

3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in C++ (48% in 2 files) and R (52% in 21 files)
1 authors
3 vignettes
1 internal data file
3 imported packages
15 exported functions (median 15 lines of code)
216 non-exported functions in R (median 3 lines of code)
48 R functions (median 38 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	21	82.3
files_src	2	79.1
files_vignettes	3	92.4
files_tests	18	95.7
loc_R	2139	85.4
loc_src	1982	75.6
loc_vignettes	342	67.9
loc_tests	1532	91.7
num_vignettes	3	94.2
data_size_total	9034	70.4
data_size_median	9034	78.5
n_fns_r	231	91.6
n_fns_r_exported	15	58.5
n_fns_r_not_exported	216	94.0
n_fns_src	48	66.0
n_fns_per_file_r	7	77.6
n_fns_per_file_src	24	94.8
num_params_per_fn	3	33.6
loc_per_fn_r	3	1.1	TRUE
loc_per_fn_r_exp	15	35.6
loc_per_fn_r_not_exp	3	1.5	TRUE
loc_per_fn_src	38	88.9
rel_whitespace_R	50	96.1	TRUE
rel_whitespace_src	58	89.6
rel_whitespace_vignettes	56	83.5
rel_whitespace_tests	45	96.8	TRUE
doclines_per_fn_exp	46	58.1
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	364	93.5

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

4. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

3a. Continuous Integration Badges

GitHub Workflow Results

name	conclusion	sha	date
Commands	skipped	069021	2022-04-22
pages build and deployment	success	c2efe9	2022-04-28
pkgcheck	success	c73bb9	2022-04-28
pkgdown	success	c73bb9	2022-04-28
R-CMD-check	success	c73bb9	2022-04-28
test-coverage	success	c73bb9	2022-04-28

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following note:

checking installed package size ... NOTE installed size is 6.9Mb sub-directories of 1Mb or more: libs 6.0Mb

R CMD check generated the following check_fails:

no_import_package_as_a_whole
rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 97.13

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function	cyclocomplexity
orsf	33
check_orsf_inputs	28
orsf_pd_	22
ref_code	17
check_new_data_names	15
check_predict	15

Static code analyses with lintr

lintr found the following 223 potential issues:

message	number of times
Avoid using sapply, consider vapply instead, that's type safe	10
Lines should not be more than 80 characters.	182
Use <-, not =, for assignment.	31

Package Versions

package	version
pkgstats	0.0.4.30
pkgcheck	0.0.3.11
srr	0.0.1.149

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

Apr 28 '22 10:04 ropensci-review-bot

@ropensci-review-bot assign @tdhock as editor

May 04 '22 18:05 jooolia

Assigned! @tdhock is now the editor

May 04 '22 18:05 ropensci-review-bot

@ropensci-review-bot seeking reviewers

May 09 '22 17:05 tdhock

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

May 09 '22 17:05 ropensci-review-bot

Please add this badge to the README of your package repository:
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)
Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

Done! Looking forward to the review.

May 10 '22 18:05 bcjaeger

Hi @tdhock,

Thank you for acting as the editor of this submission. I have recently added a paper.md file to the repository in hopes that aorsf can be eligible for an expedited review at the Journal of Open Source Software after the ropensci review. Can you let me know if any other files are needed or if other actions are needed from me to qualify for an expedited review at JOSS? Thank you!

May 16 '22 11:05 bcjaeger

I'm not sure about the expedited JOSS review. You may ask @noamross or @mpadge

May 16 '22 17:05 tdhock

@bcjaeger The paper.md file should be sufficient. For JOSS submission, after RO review is complete you should submit to JOSS and link to this review thread in the pre-review JOSS thread. JOSS editors will review the paper.md and can opt to use RO's software reviews rather than finding additional reviewers.

May 16 '22 17:05 noamross

Thanks, @noamross! That makes sense.

May 16 '22 20:05 bcjaeger

Hi @tdhock, I see we are still seeking reviewers. Is there anything I can do to help find reviewers for this submission?

Jun 07 '22 12:06 bcjaeger

I have asked a few people to review but no one has agreed yet, do you have any idea for potential reviewers to ask?

Jun 14 '22 19:06 tdhock

Thanks! Assuming folks in the ROpenSci circle have been asked already, I'll offer some names (username) that may not have been asked yet but would be good reviewers for aorsf.

Terry Therneau (therneau) would be a good reviewer - a lot of the C code in aorsf is based on his survival package coxph routine.

Torsten Hothorn (thothorn) would also be a good reviewer - Torsten is the author of the party package and I'd like aorsf to look like the party package in 10 years.

Hannah Frick (hfrick), Emil Hvitfeldt (EmilHvitfeldt), Max Kuhn (topepo), Davis Vaughan (DavisVaughan), and Julia Silge (juliasilge) would all be good reviewers - they are all developers/contributors to the censored package, and I'd like aorsf to contribute to that package.

Raphael Sonabend (RaphaelS1), Andreas Bender (adibender), Michel Lang (mllg), and Patrick Schratz (pat-s) would all be good reviewers - they are developers/contributors to the mlr3-proba package, and I'd like aorsf to contribute to that package.

Jun 16 '22 11:06 bcjaeger

Hi @tdhock, I am thinking about reaching out to the folks I listed in my last comment. Before I try to reach them, I was wondering if you had already contacted them? If you have, I will not bother them again with a review request.

Jul 02 '22 02:07 bcjaeger

Hi @bcjaeger actually I have not asked any of those reviewers yet, sorry! I just got back to the office from traveling. Yes that would be helpful if you could ask them to review. (although it is suggested to not use more than one of them, https://devguide.ropensci.org/editorguide.html?q=reviewers#where-to-look-for-reviewers)

Jul 06 '22 17:07 tdhock

Hi @tdhock, thanks! I will reach out to the folks in my earlier post and let you know if anyone is willing to review aorsf. Hope you had a great trip!

Jul 06 '22 17:07 bcjaeger

@ropensci-review-bot add @chjackson to reviewers

Jul 08 '22 17:07 tdhock

@chjackson added to the reviewers list. Review due date is 2022-07-29. Thanks @chjackson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

Jul 08 '22 17:07 ropensci-review-bot

@chjackson: If you haven't done so, please fill this form for us to update our reviewers records.

Jul 08 '22 17:07 ropensci-review-bot

@tdhock Is this the right link for the reviewer guide: (https://devguide.ropensci.org/reviewerguide.html). The link in the bot post gives a 404.

Jul 08 '22 17:07 chjackson

@nicholasjhorton yes thank you very much for agreeing to review.

Jul 08 '22 17:07 tdhock

@nicholasjhorton and @chjackson, thank you for agreeing to review!! I am thrilled and looking forward to talking about aorsf with you.

Jul 09 '22 11:07 bcjaeger

I'm not sure that I was asked to review. Can you please resend the invite @tdhock?

Jul 09 '22 16:07 nicholasjhorton

@ropensci-review-bot add @nicholasjhorton to reviewers

Jul 11 '22 17:07 tdhock

@nicholasjhorton added to the reviewers list. Review due date is 2022-08-01. Thanks @nicholasjhorton for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

Jul 11 '22 17:07 ropensci-review-bot

@nicholasjhorton: If you haven't done so, please fill this form for us to update our reviewers records.

Jul 11 '22 17:07 ropensci-review-bot

Package Review

Briefly describe any working relationship you have (had) with the package authors.
- None
☒ As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

☒ A statement of need: clearly stating problems the software is designed to solve and its target audience in README
☒ Installation instructions: for the development version of package and any non-standard dependencies in README
☒ Vignette(s): demonstrating major functionality that runs successfully locally
☒ Function Documentation: for all exported functions
- I built and installed the github package locally, and did help(package="aorsf"). I couldn’t see a help("aorsf-package") from R, is this because it it labelled with keyword “internal”? I don’t know if a package overview help page is strictly necessary given the DESCRIPTION and README files, but I sometimes find them helpful (is this the “multiple points of entry” principle?).
☒ Examples: (that run successfully locally) for all exported functions
- There are currently no examples for orsf or orsf_train on the website, but these appeared in the local R help. All the others ran with no problem.
☒ Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
- I couldn’t find anything about community / contribution guidelines.

Functionality

☒ Installation: Installation succeeds as documented.
☒ Functionality: Any functional claims of the software been confirmed.
☐ Performance: Any performance claims of the software been confirmed.
- Given my lack of machine learning experience I don’t think I’m qualified to judge relative peformance of different ML packages - I wouldn’t spot if they were using different assumptions that would cloud the comparison.
☒ Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- Looks fairly comprehensive on a quick look through, and they passed.
☒ Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.
- I didn’t have time to go through every single point made in these guidelines, but what I saw of the package made me confident that it does what a well-organised and accessible package should do.

Estimated hours spent reviewing: 5

☒ Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer (“rev” role) in the package DESCRIPTION file.

Review Comments

While I have a lot of experience with parametric statistical modelling of survival data, I am not knowledgeable about machine learning. So I can’t go into depth on the ML methods side of things. Though the vignettes explained clearly what the method is supposed to do - I felt like I learnt some useful things from them.

As far as I can tell, this is a polished, easily accessible package that implements a widely useful method. The differences and advances from other packages and methods are explained concisely.

I tried the package out by using it on a survival dataset that I had been using to test some of my own work, the colon data from the survival package. This is a simple randomised trial that compares time to cancer recurrence or death for three treatment groups, and I considered treatment as the only predictor. This may not be an ideal application for a random forest model, but hopefully it is helpful to see how the method does on this case.

library(survival)
library(dplyr)
colonc <- colon |> 
 filter(etype == 1) |>
 mutate(lev = factor(rx=="Lev"),
        lev5fu = factor(rx=="Lev+5FU"))

When using orsf with a single factor predictor, I got "Error: formula must have at least 2 predictors". Perhaps it would be helpful to explain the restriction to 2+ predictors somewhere, though I guess that it is rare that people will want to use a machine learning method with only one predictor.

fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ rx)

When I recoded the data so the formula included two binary predictors, it worked. Though I realise that in this case the data contain only three out of the four possible combinations of the two binary variables, so it is impossible to identify whether there is an interaction.

I then calculated predicted survival at 2000 and 3000 days, under the two methods, for each of the four combinations of predictors.

library(aorsf)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu)
fit <- orsf(data_train = colonc, formula = Surv(time, status) ~ lev + lev5fu, n_tree = 100, n_split=3)
pd_spec <- list(lev=c("FALSE","TRUE"), 
                lev5fu=c("FALSE","TRUE"))
pd_data <- orsf_pd_summary(object = fit, pd_spec = pd_spec,
                           pred_horizon = c(2000, 3000),
                           oobag = TRUE, expand_grid = TRUE, 
                           risk=FALSE)
pd_data

##    pred_horizon   lev lev5fu      mean       lwr      medn       upr
## 1:         2000 FALSE  FALSE 0.4511464 0.4457626 0.4510625 0.4572469
## 2:         2000  TRUE  FALSE 0.4496973 0.4451570 0.4496190 0.4549601
## 3:         2000 FALSE   TRUE 0.6138173 0.6072743 0.6138223 0.6205446
## 4:         2000  TRUE   TRUE 0.5574858 0.5362698 0.5575599 0.5803773
## 5:         3000 FALSE  FALSE 0.4239031 0.4180159 0.4237805 0.4305380
## 6:         3000  TRUE  FALSE 0.4224044 0.4174938 0.4224400 0.4274585
## 7:         3000 FALSE   TRUE 0.6016161 0.5949703 0.6015791 0.6084788
## 8:         3000  TRUE   TRUE 0.5405302 0.5183706 0.5406791 0.5655176

I compared the osrf model to a flexible parametric spline-based, non-proportional hazards survival model from the flexsurv package.

library(flexsurv)
spl <- flexsurvspline(Surv(time, status) ~ lev + lev5fu + gamma1(lev) + gamma1(lev5fu), data=colonc, k=5)
summary(spl, newdata=expand.grid(pd_spec), t=c(2000,3000), tidy=TRUE) |> 
 arrange(time, lev5fu, lev)

##   time       est       lcl       ucl   lev lev5fu
## 1 2000 0.4418226 0.3930134 0.4876635 FALSE  FALSE
## 2 2000 0.4495771 0.3991124 0.4943800  TRUE  FALSE
## 3 2000 0.6143594 0.5584856 0.6625904 FALSE   TRUE
## 4 2000 0.6207677 0.5397314 0.6971653  TRUE   TRUE
## 5 3000 0.4084440 0.3582102 0.4573942 FALSE  FALSE
## 6 3000 0.4186286 0.3673111 0.4650971  TRUE  FALSE
## 7 3000 0.5871624 0.5340537 0.6394870 FALSE   TRUE
## 8 3000 0.5958252 0.5095980 0.6783956  TRUE   TRUE

The predictions agreed for each combination except the one with both factor levels TRUE. This is understandable because these combinations do not appear in the data, and the two methods will be relying on different assumptions to extrapolate to this combination.

At this point I wondered how the intervals in the output were determined, and what do they mean? Do they reflect uncertainty about an expected value, or variability in observed values, or something else? The help page (argument prob_values) explains them as quantiles, but of what? It can’t be quantiles of some subset of the data with that combination of predictors, as there isn’t any such data in the training set. Perhaps this is obvious to someone with more experience with random forests or related ML methods.

Observations from the Introduction vignette

The y axis for “partial dependence of bilirubin and edema” should start at 0. I’d also end it at 1, but this is perhaps only necessary if the graph is intended to be compared with other graphs of predicted risk.
It wasn’t clear what the package was doing when the pred_horizon argument was not supplied to the functions that did prediction. What horizon is being used? I found that orsf_pd_summary didn’t show in the output what prediction time was being used, in cases where the prediction time wasn’t supplied by the user. Though from orsf_summarize_uni, I deduced this was the median follow-up time in the data. I’d recommend showing good practice in the worked examples, e.g. using a meaningful prediction time that someone doing the analysis might use (e.g. 5 years). As a developer, I’ve found that it’s common for users to do silly things by copying from worked examples in manuals without thinking.

Jul 18 '22 16:07 chjackson

software-review software-review copied to clipboard

aorsf; accelerated oblique random survival forests

Due date for @mnwright: 2022-09-21 Due date for @jemus42: 2022-09-21 Archive: TBD Version accepted: TBD Language: en

Pre-submission Inquiry

General Information

Badging

Technical checks

Publication options

Code of conduct

Checks for aorsf (v0.0.0.9000)

1. rOpenSci Statistical Standards (srr package)

2. Package Dependencies

3. Statistical Properties

3a. Network visualisation

4. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

Editor-in-Chief Instructions:

Package Review

Documentation

Functionality

Review Comments

Observations from the Introduction vignette

software-review
software-review copied to clipboard

1. rOpenSci Statistical Standards (`srr` package)

4. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck