software-review icon indicating copy to clipboard operation
software-review copied to clipboard

canaper: Categorical Analysis of Neo- And Paleo-Endemism in R

Open joelnitta opened this issue 2 years ago • 40 comments

Submitting Author: @joelnitta Other Package Authors: (delete if none) Shawn Laffan (@shawnlaffan) Repository: https://github.com/joelnitta/canaper Version submitted: 0.0.1 Submission type: Stats Badge grade: silver Editor: @tdhock Reviewers: @KlausVigo, @luismurao

Due date for @KlausVigo: 2021-11-24

Due date for @luismurao: 2021-12-21 Archive: TBD Version accepted: TBD


  • Paste the full DESCRIPTION file inside a code block below:
Package: canaper
Title: Categorical Analysis of Neo- And Paleo-Endemism
Version: 0.0.1
Authors@R: 
    c(
    person(given = "Joel H.",
           family = "Nitta",
           role = c("aut", "cre"),
           email = "[email protected]",
           comment = c(ORCID = "0000-0003-4719-7472")),
    person(given = "Shawn W.",
           family = "Laffan",
           role = c("ctb", "dtc")),
    person(given = "Brent D.",
           family = "Mishler",
           role = c("ctb", "dtc")),
    person(given = "Wataru",
           family = "Iwasaki",
           role = c("ctb"),
           comment = c(ORCID = "0000-0002-9169-9245"))
           )
Description: Provides functions to conduct categorical analysis of neo- and paleo-endemism (CANAPE).
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("collate", "namespace", "rd", "roxyglobals::global_roclet", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
URL: https://github.com/joelnitta/canaper
BugReports: https://github.com/joelnitta/canaper/issues
Imports: 
    ape,
    assertr,
    assertthat,
    dplyr,
    future.apply,
    phyloregion,
    progressr,
    purrr,
    stats,
    tibble,
    vegan
Suggests: 
    rmarkdown,
    knitr,
    future,
    tictoc,
    patchwork,
    tidyverse,
    testthat (>= 3.0.0),
    roxyglobals (>= 0.2.1),
    stringr,
    magrittr,
    covr,
    picante
Config/testthat/edition: 3
Depends: 
    R (>= 3.5.0)
VignetteBuilder: knitr
Remotes: 
    anthonynorth/roxyglobals

Pre-submission Inquiry

  • [x] A pre-submission inquiry has been approved in issue #469

General Information

  • Who is the target audience and what are scientific applications of this package?

    • Target audience is evolutionary biologists. Scientific application is the analysis of phylogenetic endemism.
  • Paste your responses to our General Standard G1.1 here, describing whether your software is:

    • The first implementation within R of an algorithm which has previously been implemented in other languages or contexts

    • Please include hyperlinked references to all other relevant software: https://github.com/shawnlaffan/biodiverse

  • (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

    • (Not applicable)

Note also there is a package website available: https://joelnitta.github.io/canaper/

Badging

  • What grade of badge are you aiming for? (bronze, silver, gold)

    • Silver, but open to suggestions from the reviewers and editors.
  • If aiming for silver or gold, describe which of the four aspects listed in the Guide for Authors chapter the package fulfils (at least one aspect for silver; three for gold)

    • Demonstrates excellence in compliance with multiple standards from at least two broad sub-categories
    • Compliance with a good number of standards beyond those identified as minimally necessary. I am in compliance with most of the standards for "General" and "Dimensionality Reduction, Clustering, and Unsupervised Learning".

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • [x] Do you intend for this package to go on CRAN?

  • [ ] Do you intend for this package to go on Bioconductor?

    • Also, I intend to submit a paper about this package to Methods in Ecology and Evolution after ROpenSci review.

Code of conduct

  • [x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

joelnitta avatar Oct 26 '21 09:10 joelnitta

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot avatar Oct 26 '21 09:10 ropensci-review-bot

:rocket:

The following problem was found in your submission template:

  • 'author1' variable must be GitHub hanle only ('@myhandle') Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

:wave:

ropensci-review-bot avatar Oct 26 '21 09:10 ropensci-review-bot

Checks for canaper (v0.0.1)

git hash: 43a9f0a2

  • :heavy_check_mark: Package name is available
  • :heavy_check_mark: has a 'CITATION' file.
  • :heavy_check_mark: has a 'codemeta.json' file.
  • :heavy_check_mark: has a 'contributing' file.
  • :heavy_check_mark: uses 'roxygen2'.
  • :heavy_check_mark: 'DESCRIPTION' has a URL field.
  • :heavy_check_mark: 'DESCRIPTION' has a BugReports field.
  • :heavy_check_mark: Package has at least one HTML vignette
  • :heavy_check_mark: All functions have examples.
  • :heavy_check_mark: Package has continuous integration checks.
  • :heavy_check_mark: Package coverage is 100%.
  • :heavy_check_mark: R CMD check found no errors.
  • :heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Dimensionality Reduction, Clustering and Unsupervised Learning

:heavy_check_mark: All applicable standards [v0.1.0] have been documented in this package

Click here to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 11 files) and
  • 1 authors
  • 3 vignettes
  • 8 internal data files
  • 11 imported packages
  • 4 exported functions (median 41 lines of code)
  • 30 non-exported functions in R (median 7 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 11 59.3
files_vignettes 2 83.0
files_tests 8 85.7
loc_R 650 52.9
loc_vignettes 162 64.9
loc_tests 939 83.5
num_vignettes 3 93.1
data_size_total 66301 81.2
data_size_median 509 59.9
n_fns_r 34 35.6
n_fns_r_exported 4 15.6
n_fns_r_not_exported 30 42.6
n_fns_per_file_r 3 42.7
num_params_per_fn 4 67.6
loc_per_fn_r 10 40.0
loc_per_fn_r_exp 41 75.3
loc_per_fn_r_not_exp 7 29.9
rel_whitespace_R 13 43.9
rel_whitespace_vignettes 0 0.0 TRUE
rel_whitespace_tests 7 84.7
doclines_per_fn_exp 55 68.2
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 9 22.0

2a. Network visualisation

Interactive network visualisation of calls between objects in package can be viewed by clicking here


3. goodpractice and other checks

Details of goodpractice and other checks (click to open)

3a. Continuous Integration Badges

github

GitHub Workflow Results

name conclusion sha date
pkgdown success 43a9f0 2021-10-26
R-CMD-check success 43a9f0 2021-10-26
Render codemeta failure 43a9f0 2021-10-26
test-coverage success 43a9f0 2021-10-26

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 100

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
cpr_rand_test 29
calc_biodiv_random 16

Static code analyses with lintr

lintr found the following 353 potential issues:

message number of times
Lines should not be more than 80 characters. 353

Package Versions

package version
pkgstats 0.0.2.16
pkgcheck 0.0.2.86
srr 0.0.1.120

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

ropensci-review-bot avatar Oct 26 '21 10:10 ropensci-review-bot

@ropensci-review-bot seeking reviewers

tdhock avatar Oct 27 '21 16:10 tdhock

I'm sorry @tdhock, I'm afraid I can't do that. That's something only editors are allowed to do.

ropensci-review-bot avatar Oct 27 '21 16:10 ropensci-review-bot

@ropensci-review-bot assign @tdhock as editor

noamross avatar Oct 27 '21 16:10 noamross

Assigned! @tdhock is now the editor

ropensci-review-bot avatar Oct 27 '21 16:10 ropensci-review-bot

I forgot to mention, I plan to submit this as a paper to Methods in Ecology and Evolution after ROpenSci review (have also added this to the original description)

joelnitta avatar Oct 28 '21 06:10 joelnitta

@ropensci-review-bot assign @KlausVigo as reviewer

tdhock avatar Oct 29 '21 16:10 tdhock

@tdhock Only just realised the command there has the wrong syntax - should be add <@name> to reviewers, like this example. Could you please try again? Nevertheless good that you got this one wrong, because it made us realise we need the bot to tell you it didn't understand. Thanks!

mpadge avatar Nov 02 '21 10:11 mpadge

@ropensci-review-bot add @KlausVigo to reviewers by the way it would be more user friendly to have the add editor and add reviewer commands be consistent / analogous, I guess that is what you are doing in https://github.com/ropensci-org/buffy/pull/42 ?

tdhock avatar Nov 02 '21 17:11 tdhock

sorry @tdhock, another bot glitch there - bot commands have to be kept as just that, with no additional comments. And yes, the point of the above-linked issue is to aid consistency between analogous commands, so that will hopefully improve soon. In the meantime, could you please try again, again, with just the command? With due apologies and gratitude! :bowing_man:

mpadge avatar Nov 02 '21 18:11 mpadge

@ropensci-review-bot add @KlausVigo to reviewers

tdhock avatar Nov 02 '21 18:11 tdhock

Can't assign reviewer because there is no editor assigned for this submission yet

ropensci-review-bot avatar Nov 02 '21 18:11 ropensci-review-bot

@tdhock sorry for all the hiccups! The initial issue body had been edited after your username was added so your username was no longer in the string <!--editor--> @tdhock<!--end-editor--> I edited it again so you can try again. Thanks for your patience!

maelle avatar Nov 03 '21 12:11 maelle

@ropensci-review-bot add @KlausVigo to reviewers

tdhock avatar Nov 03 '21 16:11 tdhock

@KlausVigo added to the reviewers list. Review due date is 2021-11-24. Thanks @KlausVigo for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot avatar Nov 03 '21 16:11 ropensci-review-bot

@KlausVigo: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot avatar Nov 03 '21 16:11 ropensci-review-bot

@ropensci-review-bot add @luismurao to reviewers

tdhock avatar Nov 30 '21 17:11 tdhock

@luismurao added to the reviewers list. Review due date is 2021-12-21. Thanks @luismurao for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot avatar Nov 30 '21 17:11 ropensci-review-bot

@luismurao: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot avatar Nov 30 '21 17:11 ropensci-review-bot

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • [x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • [x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • [x] Installation instructions: for the development version of package and any non-standard dependencies in README
  • [x] Vignette(s): demonstrating major functionality that runs successfully locally
  • [x] Function Documentation: for all exported functions
  • [x] Examples: (that run successfully locally) for all exported functions
  • [ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • [ ] Installation: Installation succeeds as documented.
  • [x] Functionality: Any functional claims of the software been confirmed.
  • [ ] Performance: Any performance claims of the software been confirmed.
  • [x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • [x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

  • ~ 9

  • [ ] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.


Review Comments

  • The canaper package provides functions to infer the evolutionary processes underlying neoendemism and paleoendemism. The software is an R implementation of the Categorical Analysis of Neo- and Paleo-endemism, which is only available in the Biodiverse software.

  • I enjoyed reviewing the package, but I found some things that might improve the package.

    • One concern is about parallelization: in the current stage of the package, parallelization calls are done outside the functions (via future::plan(future::multisession)). I think it would be better to have parameters for parallelization in functions that allow running analysis in parallel (so calls are done inside them).

Parallelization calls as in the current version of the package

library(canaper) # This package :)
library(future) # For parallel computing
library(tictoc) # For timing things
library(patchwork) # For composing mult-part plots
library(tidyverse) # For data-wrangling and plottin

data(acacia)

plan(multisession,workers=ncores)
set.seed(071421)
acacia_rand_res <- cpr_rand_test(
  acacia$comm, acacia$phy, 
  null_model = "curveball",
  n_reps = 100, n_iterations = 100000,
  tbl_out = TRUE)
plan(sequential)

I think this could be more appropriate and easy to use

acacia_rand_res <- cpr_rand_test(
  acacia$comm, acacia$phy, 
  null_model = "curveball",
  n_reps = 100, n_iterations = 100000,
  tbl_out = TRUE
  parallel = TRUE, # Logical var to run processes in parallel
  nworkes = 3, # Number of threads
  random_seed=071421) # Random seed
  • Instead of suggesting the future package I would import it

README

  • No code of conduct and contribution guidelines, please add it

Installation

There is a potential conflict in package dependencies as canaper depends on R (>= 3.5.0), but it imports phangorn version 2.8.1, which depends on R (>= 4.1.0). Please change parameter Depends: R (>= 3.5.0) to Depends: R (>= 4.1.0) in the description file or provide a way to install an earlier version of phangorn (version 2.5.5 should work).

  • I installed the package locally on macOS (Big Sur), Linux (Ubuntu 20.04.3 & 21.04), and Windows 10

Automated test

  • devtools::check() takes more than six minutes. Building of vignette outputs takes a lot of time (5m 5.2s). Consider using lees number of replicates in examples (i.e. n_reps of canaper::cpr_rand_test).
    • I got one error after upgrading from GEOS 3.9 to 3.10 in Linux. The issue is related to phyloregion dependency.
══ Warnings ════════════════════════════════════════════════════════════════════
── Warning (test-utils.R:143:3): Functions copied from phyloregion work ────────
rgeos: versions of GEOS runtime 3.10.1-CAPI-1.16.0
and GEOS at installation 3.9.1-CAPI-1.14.2differ
Backtrace:
	1. testthat::expect_equal(dense2sparse(biod_example$comm), phyloregion::dense2sparse(biod_example$comm)) test-utils.R:143:2
4. base::loadNamespace(x)
7. base::loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
8. base:::runHook(".onLoad", env, package.lib, package)
13. rgeos:::fun(libname, pkgname)
  • devtools::test() finished with no fails and one warning
────────────────────────────────────────────────────────────────────
Warning (test-utils.R:143:3): Functions copied from phyloregion work
rgeos: versions of GEOS runtime 3.10.1-CAPI-1.16.0
and GEOS at installation 3.9.1-CAPI-1.14.2differ
Backtrace:
  1. testthat::expect_equal(dense2sparse(biod_example$comm), phyloregion::dense2sparse(biod_example$comm)) test-utils.R:143:2
  4. base::loadNamespace(x)
  7. base::loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
  8. base:::runHook(".onLoad", env, package.lib, package)
 13. rgeos:::fun(libname, pkgname)
────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════
Duration: 71.8 s

[ FAIL 0 | WARN 1 | SKIP 0 | PASS 255 ]
  • devtools::spell_check() Change "signficance" to "significance" in line 132
signficance       canape.Rmd:132

Performance

When benchmarking, I found a strange behavior. After using more than four threads, the program takes more time to finish a randomization test. Here is the code I used (could you please check it? I might be missing something)

library(purrr)
library(tidyverse)
library(tictoc)
library(future)
library(ggplot2)
rm(list = ls())
data(acacia)
acom <- acacia$comm
acom[acom>0] <- 1
n_reps_vec <- c(100,500,1000,1500,2000)
nwokers <- 1:7
acom <- acacia$comm
acom[acom>0] <- 1

rand_times <- seq_along(n_reps_vec) %>% purrr::map_dfr(function(nrep){
  rand_t <- nwokers %>% purrr::map_dfr(function(nwork){
      if(nwork>1){
        plan(multisession,workers=nwork)
      }
      set.seed(071421)
      tictoc::tic() 
      # Run randomization test
      acacia_rand_res <- canaper::cpr_rand_test(
        acom, acacia$phy,
        null_model = "curveball",
        n_reps = n_reps_vec[nrep],
        n_iterations = 100000,
        tbl_out = TRUE)
      aa <- tictoc::toc()
      r1 <-data.frame(n_reps = n_reps_vec[nrep] ,
                      nworkers=nwork,
                      time_secs = aa$toc - aa$tic)
      plan(sequential)
      return(r1)
    })
  return(rand_t)
})

ggplot(rand_times, aes(x = ncores,y=time_secs,color=as.factor(n_reps))) +
  geom_line() + theme_classic()

Questions and comments

  • I was wondering if canaper has already all methods of the CANAPE version of Biodiverse. Could write something about it in the README?
  • Although the vignettes are very useful and well documented, it would be nice to have methods for plotting spatial data results.

@joelnitta please do not hesitate to contact me if you have any questions. Thanks for developing canaper.

luismurao avatar Dec 29 '21 19:12 luismurao

@luismurao thanks so much for your review! I will try to address your comments soon.

joelnitta avatar Dec 29 '21 22:12 joelnitta

@KlausVigo it would be great to have your comments as well so I can make changes incorporating both sets of reviews.

joelnitta avatar Dec 30 '21 09:12 joelnitta

@KlausVigo, do you still intend to review this package?

If not, @tdhock can you please assign a new reviewer, as it is now multiple months past the due date.

Thanks!

joelnitta avatar Feb 14 '22 12:02 joelnitta

Hi @joelnitta! I will finish it today.

KlausVigo avatar Feb 14 '22 12:02 KlausVigo

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • [x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • [x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • [x] Installation instructions: for the development version of package and any non-standard dependencies in README
  • [x] Vignette(s): demonstrating major functionality that runs successfully locally
  • [x] Function Documentation: for all exported functions
  • [x] Examples: (that run successfully locally) for all exported functions
  • [ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • [x] Installation: Installation succeeds as documented.
  • [x] Functionality: Any functional claims of the software been confirmed.
  • [ ] Performance: Any performance claims of the software been confirmed.
  • [x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • [ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 12

  • [x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

  • The canaper package provides functions to conduct categorical analysis of neo- and paleo-endemism (CANAPE). It is an reimplementation of the Biodiverse software, but there are some functions in the picante (ses.pd) and in the PhyloMeasures package for Faith's phylogenetic diversity with a similar scope.

  • The canaper package builds on functions from the vegan package to resample (dense) community matrices and borrows biodiversity measures (Faith's phylogenetic diversity and phylogenetic endemism) from the phyloregion package. The picante package also has some functions for swapping (community) matrices similar to vegan. A comparison between the packages would be useful.

Specific comments

  • The functions from the phyloregion package used in the package should be acknowledged properly. There is a reference in the README and in the source that the biodiversity measure are from the phyloregion package. However tools like depsy which trace back contributions will likely check only the DESCRIPTION file for imported packages, contributions etc.. If functions are copied the CRAN Repository Policy suggests to acknowledge contributors (ctb) in the Author field. I opened an issue to improve the development guide (https://github.com/ropensci/dev_guide/issues/388) with some more additional information. Copying functions can additionally cause copyright problems. An alternative is importing the functions from the package directly.

  • In contrast to the other reviewer I really like the use of the future package. Parallelization is dependent on the hardware, the operation system and on other things like if the functions run in a terminal or GUI. To avoid crashes this needs lots of internal checks and guessing and often will not lead to the best result. I prefer to give users the power to adopt the parallelization to their specific needs.

Profiling

In my opinion the main bottleneck for this kind of analysis is memory consumption and only minor part computing time. Dense community matrices are now frequently getting too large to be hold in memory (RAM). Luckily community matrices are usually extremely sparse and this allows efficient coding. For example the acacia data set, which comes with canaper, has less than 2% non-zero cells and a sparse representation takes only 1/20 of memory of the dense representation.

> library(canaper)
> library(Matrix)
> data("acacia")
> dim(acacia$comm)
[1] 3037  506
> comm <- as.matrix(acacia$comm)
> sum(comm > 0) / prod(dim(comm))
[1] 0.01898522
> object.size(comm)
12597112 bytes
> object.size(Matrix(comm, sparse=TRUE))
656576 bytes

The functions to compute the biodiversity measures make use of the sparsity, but the functions to sample the matrices not. We developed the phyloregion using sparse matrices which resulted in a far lower memory consumption and as a benefit it turned out faster in many cases (https://phyloregion.com/articles/Benchmark.html). With larger data sets we get into a trade-off. We want to use more cores, but this means several copies of the community matrices in memory. Providing a swapping algorithm which can handle sparse matrices would be a huge improvement for the community.

I used the following code for profiling:

library(canaper)
data(acacia)
Rprof(tmp <- tempfile(), memory.profiling = TRUE)

acacia_rand_res <- cpr_rand_test(
	acacia$comm, acacia$phy,
	null_model = "curveball",
	n_reps = 100,
	n_iterations = 100000)

Rprof()
summaryRprof(tmp, memory="both")
unlink(tmp)

I just highlight a few lines of the output I find interesting:

...
"calc_biodiv_random"              111.54     92.35   29477.9      0.02     0.02
"cpr_rand_comm"                    99.56     82.43   22420.0      1.30     1.08
"simulate.nullmodel"               83.18     68.87   11655.6      1.24     1.03
...
"phylo_community"                   5.72      4.74    2994.7      1.60     1.32
"phylo_endemism"                    5.14      4.26    2876.1      0.02     0.02
"assertthat::assert_that"           4.66      3.86    1867.0      0.10     0.08
...
"PD"                                2.56      2.12    1656.1      0.00     0.00	
...
  • Not unexpected the largest amount of time is spent in re-sampling the community matrices (simulate.nullmodel).

  • The canaper package uses defensive programming extensively (calling assertthat::assert_that), which is generally a very good thing. However, here are some exceptions: The functions calc_biodiv_random and cpr_rand_comm are only called internally and are called thousands of times inside of the loop. It should be save to remove all (most) of the calls to assertthat::assert_that from these functions as parameters have been checked already and are save to use! In fact the algorithm spent a similar amount of time in assertthat::assert_that than computing the biodiversity measures (phylo_endemism, PD).

  • When computing several biodiversity metrics at once a lot of intermediate results are shared and could be reused. This is not as important and the current approach allows to easily extend the cpr_rand_test function with additional biodiversity measures.

For submitting to CRAN:

There are some instances where more than 2 cores are used, e.g. the vignette canape.Rmd, line 83:

plan(multisession, workers = 3)

This could fail on CRAN, when submitting. There the use of maximal 2 cores is allowed and I speak from experience, I got Ripleyed for this once! Again the CRAN Repository Policy is a useful resource and gives some background information: "Checking the package should take as little CPU time as possible, as the CRAN check farm is a very limited resource and there are thousands of packages. Long-running tests and vignette code can be made optional for checking, but do ensure that the checks that are left do exercise all the features of the package. If running a package uses multiple threads/cores it must never use more than two simultaneously: the check farm is a shared resource and will typically be running many checks simultaneously. Examples should run for no more than a few seconds each: they are intended to exemplify to the would-be user how to use the functions in the package."

KlausVigo avatar Feb 21 '22 16:02 KlausVigo

@KlausVigo thank you very much for the extensive and helpful review!

I just wanted to address one thing first: the reason I copied functions from phyloregion instead of importing them is because of a dependency of phyloregion (betapart), which has a package in Suggests (snow) that was resetting the random number generator and thus cpr_rand_test() was failing to return identical results when using parallel computing, even when set.seed() was used. There is an issue describing the problem in canaper, and a related discussion in the future package.

In the end, I opted to copy the phyloregion functions under the AGPL-3 license, since that fixed the issue and resulted in fewer dependencies. As @HenrikBengtsson mentioned though, it seems another way to fix this (and import phyloregion normally) would be to use future.packages = "snow" with future.apply::future_lapply(). Would you prefer that?

(By the way, it is possible that this behavior of snow may cause reproducibility issues with phyloregion as well. I suggest you look into it if you haven't already).

joelnitta avatar Feb 21 '22 20:02 joelnitta

hi @KlausVigo thanks for your detailed review. can you please tell us approximately how many hours you spent on reviewing? (you left that field blank in your review comment)

tdhock avatar Mar 02 '22 20:03 tdhock

@ropensci-review-bot submit review https://github.com/ropensci/software-review/issues/475#issuecomment-1002754119 time 9

tdhock avatar Mar 02 '22 20:03 tdhock