software-review icon indicating copy to clipboard operation
software-review copied to clipboard

Submitting LBDiscover package

Open chaoliu-cl opened this issue 3 months ago • 13 comments

Submitting Author Name: Chao Liu Submitting Author Github Handle: @chaoliu-cl Other Package Authors Github handles: (comma separated, delete if none) Repository: https://github.com/chaoliu-cl/LBDiscover Version submitted: Submission type: Standard Editor: TBD Reviewers: TBD

Archive: TBD Version accepted: TBD Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: LBDiscover
Title: Literature-Based Discovery Tools for Biomedical Research
Version: 0.1.0
Date: 2025-05-14
Authors@R: 
    person("Chao Liu", email = "[email protected]", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0002-9979-8272"))
Description: A suite of tools for literature-based discovery in biomedical research. 
    Provides functions for retrieving scientific articles from PubMed and 
    other NCBI databases, extracting biomedical entities (diseases, drugs, genes, etc.), 
    building co-occurrence networks, and applying various discovery models 
    including ABC, AnC, LSI, and BITOLA. The package also includes 
    visualization tools for exploring discovered connections.
License: GPL-3
URL: https://github.com/chaoliu-cl/LBDiscover, http://liu-chao.site/LBDiscover/, https://liu-chao.site/LBDiscover/
BugReports: https://github.com/chaoliu-cl/LBDiscover/issues
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Depends: 
    R (>= 4.0.0)
Imports: 
    httr (>= 1.4.0),
    xml2 (>= 1.3.0),
    igraph (>= 1.2.0),
    Matrix (>= 1.3.0),
    utils,
    stats,
    grDevices,
    graphics,
    tools,
    rentrez (>= 1.2.0),
    jsonlite (>= 1.7.0)
Suggests:
    openxlsx (>= 4.2.0),
    SnowballC (>= 0.7.0),
    visNetwork (>= 2.1.0),
    spacyr (>= 1.2.0),
    parallel,
    digest (>= 0.6.0),
    irlba (>= 2.3.0),
    knitr,
    rmarkdown,
    base64enc,
    reticulate,
    testthat (>= 3.0.0),
    mockery,
    covr,
    htmltools
VignetteBuilder: knitr
Config/testthat/edition: 3

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • [X] data retrieval
    • [X] data extraction
    • [ ] data munging
    • [ ] data deposition
    • [ ] data validation and testing
    • [ ] workflow automation
    • [ ] version control
    • [X] citation management and bibliometrics
    • [ ] scientific software wrappers
    • [ ] field and lab reproducibility tools
    • [ ] database software bindings
    • [ ] geospatial data
    • [ ] translation
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences): Data retrieval: The package provides functions for retrieving scientific articles from PubMed and other NCBI databases. It is a tool for systematically accessing biomedical literature from major research repositories. Data extraction: It extracts biomedical entities (diseases, drugs, genes, etc.) from retrieved literature, performing information extraction from scientific texts. Citation management and bibliometrics: The package builds co-occurrence networks from literature and applies discovery models (ABC, AnC, LSI, BITOLA) to find hidden connections between concepts, which represents bibliometric analysis for literature-based discovery research.

  • Who is the target audience and what are scientific applications of this package? Target Audience: LBDiscover is designed for biomedical researchers, bioinformaticians, and data scientists working in literature-based discovery (LBD). The primary users include:

  • Biomedical researchers seeking hidden connections between diseases, drugs, and genes

  • Pharmaceutical researchers exploring drug repurposing opportunities

  • Bioinformaticians building knowledge networks from literature

  • Graduate students and academics studying computational approaches to hypothesis generation

Scientific Applications: The package supports several key research applications:

  1. Drug Discovery and Repurposing: LBD has been used extensively in drug development and repurposing as well as predicting adverse drug reactions
  2. Disease-Gene Association Discovery: Using literature-based discovery to identify disease candidate genes
  3. Biomarker Identification: LBD has been explored as a tool to identify biomarkers for diagnostic and prognostic for diseases
  4. Hypothesis Generation: Creating testable scientific hypotheses by connecting disparate pieces of literature
  5. Knowledge Network Construction: Building co-occurrence networks to visualize research landscapes
  • Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are several R packages that overlap with LBDiscover's functionality, but none provide the same comprehensive approach to literature-based discovery: Similar Packages and Key Differences:
  1. pubmed.mineR Overlap: PubMed text mining with functions for data visualization and biomedical entity extraction Difference: Focuses on general text mining and clustering rather than implementing specific LBD models like ABC, AnC, LSI, and BITOLA

  2. bibliometrix Overlap: Comprehensive science mapping analysis with network analysis capabilities and bibliometric workflows Difference: Designed for general scientometric analysis across all disciplines, not specifically for biomedical literature-based discovery or implementing LBD-specific algorithms

  3. Data Retrieval Packages (rentrez, easyPubMed, RISmed) Overlap: All provide interfaces to NCBI/PubMed for retrieving biomedical literature Difference: These focus solely on data retrieval and don't perform LBD analysis, entity extraction, or hypothesis generation

How LBDiscover Meets Best-in-Category Criteria:

  1. Unique Functionality: LBDiscover is the first R package to specifically implement established LBD models:
  • ABC Model: The most basic and widespread type of LBD centered around finding connections between concepts A, B, and C
  • BITOLA: An interactive literature-based biomedical discovery support system using semantic prediction
  • LSI (Latent Semantic Indexing): A statistical technique for improving information retrieval effectiveness used to assist in literature-based discoveries
  • AnC Model: Advanced connection models for more sophisticated discovery patterns
  1. Integrated Workflow: Unlike other packages that handle only one aspect (retrieval OR analysis OR visualization), LBDiscover provides a complete workflow from data retrieval through entity extraction to discovery model application and network visualization.
  2. Biomedical Specialization: While bibliometrix serves general scientometrics and pubmed.mineR does general text mining, LBDiscover is specifically designed for biomedical literature-based discovery with domain-specific entity recognition (diseases, drugs, genes).
  3. Modern Implementation: Recent work has focused on integrating Large Language Models for enhancing Literature-Based Discovery processes, and LBDiscover appears positioned to incorporate such advances while maintaining established methodological foundations.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • [X] Do you intend for this package to go on CRAN?

  • [ ] Do you intend for this package to go on Bioconductor?

  • [ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • [ ] The package is novel and will be of interest to the broad readership of the journal.
  • [ ] The manuscript describing the package is no longer than 3000 words.
  • [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

  • [X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

chaoliu-cl avatar Sep 24 '25 03:09 chaoliu-cl

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot avatar Sep 24 '25 03:09 ropensci-review-bot

:rocket:

Editor check started

:wave:

ropensci-review-bot avatar Sep 24 '25 03:09 ropensci-review-bot

Checks for LBDiscover (v0.1.0)

git hash: 02f4c075

  • :heavy_check_mark: Package is already on CRAN.
  • :heavy_multiplication_x: does not have a 'codemeta.json' file.
  • :heavy_multiplication_x: does not have a 'contributing' file.
  • :heavy_check_mark: uses 'roxygen2'.
  • :heavy_check_mark: 'DESCRIPTION' has a URL field.
  • :heavy_check_mark: 'DESCRIPTION' has a BugReports field.
  • :heavy_check_mark: Package has at least one HTML vignette
  • :heavy_multiplication_x: These functions do not have examples: [abc_model, anc_model, clear_pubmed_cache, create_report, eval_evidence, extract_entities_workflow, extract_entities, find_term, gen_report, get_dict_cache, get_term_vars, is_valid_biomedical_entity, load_dictionary, lsi_model, merge_entities, min_results, plot_heatmap, plot_network, prep_articles, query_external_api, query_mesh, query_umls, safe_diversify, sanitize_dictionary, valid_entities, validate_biomedical_entity, validate_entity_comprehensive, validate_entity_with_nlp].
  • :heavy_check_mark: Package has continuous integration checks.
  • :heavy_multiplication_x: Package coverage is 22.4% (should be at least 75%).
  • :heavy_multiplication_x: All examples use \dontrun{}.
  • :heavy_check_mark: R CMD check found no errors.
  • :heavy_check_mark: R CMD check found no warnings.
  • :eyes: Some goodpractice linters failed.
  • :eyes: Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: GPL-3


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 2214
internal LBDiscover 147
internal methods 9
imports stats 61
imports graphics 58
imports xml2 54
imports utils 51
imports httr 33
imports igraph 19
imports rentrez 9
imports Matrix 8
imports tools 3
imports grDevices 2
imports jsonlite 2
suggests visNetwork 8
suggests parallel 7
suggests irlba 2
suggests reticulate 2
suggests SnowballC 1
suggests spacyr 1
suggests digest 1
suggests openxlsx NA
suggests knitr NA
suggests rmarkdown NA
suggests base64enc NA
suggests testthat NA
suggests mockery NA
suggests covr NA
suggests htmltools NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

c (195), character (171), for (126), length (121), data.frame (115), nrow (112), sapply (111), min (83), list (73), max (72), grepl (55), any (54), unique (53), numeric (50), names (39), which (34), paste0 (31), sum (30), if (29), integer (25), rep (25), return (23), seq_along (22), unlist (22), attr (21), ncol (20), paste (19), tryCatch (19), rbind (18), tolower (18), is.na (17), is.null (16), matrix (16), ceiling (15), lapply (15), rownames (15), strsplit (15), table (15), colnames (14), nchar (13), as.numeric (12), seq_len (11), vector (11), match (9), order (9), round (9), regexpr (8), regmatches (8), setdiff (8), as.character (7), drop (7), gregexpr (7), ifelse (7), rowSums (7), sqrt (7), url (7), body (6), plot (6), range (6), sort (6), dim (5), gsub (5), substr (5), t (5), diff (4), grep (4), seq (4), sprintf (4), switch (4), col (3), diag (3), emptyenv (3), environment (3), file (3), log (3), logical (3), new.env (3), row (3), tapply (3), tempfile (3), all (2), apply (2), by (2), colSums (2), dimnames (2), do.call (2), mean (2), outer (2), row.names (2), sub (2), Sys.time (2), try (2), abs (1), as.data.frame (1), cat (1), colMeans (1), difftime (1), duplicated (1), expression (1), file.path (1), floor (1), format (1), interactive (1), match.arg (1), merge (1), mode (1), packageEvent (1), setHook (1), suppressMessages (1), system.file (1), units (1), unname (1), version (1), which.max (1)

LBDiscover

retry_api_call (16), create_comat (4), load_dictionary (4), pubmed_search (4), string_similarity (4), throttle_api (4), abc_model (3), authenticate_umls (3), cluster_docs (3), count_corpus_terms (3), extract_entities (3), get_pubmed_cache (3), tokenize_text (3), vec_preprocess (3), calc_doc_sim (2), calculate_score (2), create_cache_key (2), create_dummy_dictionary (2), create_term_document_matrix (2), diversify_abc (2), extract_text_ngrams (2), get_color_palette (2), get_dict_cache (2), get_service_ticket (2), is_valid_biomedical_entity (2), load_dict_single (2), load_from_mesh (2), load_from_umls (2), load_mesh_terms_from_pubmed (2), process_mesh_xml (2), abc_model_opt (1), abc_model_sig (1), abc_timeslice (1), add_statistical_significance (1), alternative_validation (1), anc_model (1), apply_bitola_flexible (1), apply_correction (1), b_term_type_filter (1), bitola_model (1), calc_bibliometrics (1), clear_pubmed_cache (1), compare_terms (1), create_citation_net (1), create_report (1), create_single_heatmap (1), create_sparse_comat (1), create_tdm (1), create_vis_heatmap (1), detect_lang (1), diversify_b_terms (1), diversify_c_paths (1), enhance_abc_kb (1), eval_evidence (1), export_chord (1), export_chord_diagram (1), export_network (1), extract_entities_workflow (1), extract_mesh_from_text (1), extract_ner (1), extract_ngrams (1), extract_terms (1), extract_topics (1), fetch_and_parse_gene (1), fetch_and_parse_pmc (1), fetch_and_parse_protein (1), fetch_and_parse_pubmed (1), filter_by_type (1), filter_terms_for_abc_model (1), find_abc_all (1), find_similar_docs (1), find_term (1), gen_report (1), get_pmc_fulltext (1), get_term_vars (1), get_type_dist (1), get_umls_semantic_types (1), is_valid_type (1), list_to_df (1), load_results (1), parse_pubmed_xml (1), preprocess_text (1), process_batch (1), split_into_sentences (1), split_text (1)

stats

df (19), terms (16), p.adjust (5), phyper (4), kmeans (3), profile (3), aggregate (2), runif (2), setNames (2), smooth (2), complete.cases (1), dist (1), pt (1)

graphics

text (29), par (13), title (8), layout (6), arrows (2)

xml2

xml_find_first (19), xml_text (19), xml_find_all (10), read_xml (4), xml_attr (1), xml_name (1)

utils

txtProgressBar (40), read.csv (4), adist (2), write.csv (2), de (1), head (1), URLencode (1)

httr

content (18), GET (8), POST (5), headers (2)

igraph

graph_from_data_frame (12), layout_with_fr (6), degree (1)

methods

new (9)

rentrez

entrez_link (3), entrez_search (3), entrez_fetch (2), entrez_summary (1)

Matrix

t (4), diag (2), sparseMatrix (2)

visNetwork

visEdges (2), visGroups (2), visNetwork (2), visLayout (1), visSave (1)

parallel

clusterExport (3), parLapply (2), detectCores (1), makeCluster (1)

tools

file_ext (3)

grDevices

colorRampPalette (1), rainbow (1)

irlba

irlba (2)

jsonlite

fromJSON (2)

reticulate

import (2)

digest

digest (1)

SnowballC

wordStem (1)

spacyr

spacy_parse (1)


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 13 files) and
  • 1 authors
  • 3 vignettes
  • no internal data file
  • 11 imported packages
  • 105 exported functions (median 47 lines of code)
  • 146 non-exported functions in R (median 48 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 13 65.8
files_inst 4 97.1
files_vignettes 3 89.3
files_tests 9 84.9
loc_R 8759 97.6 TRUE
loc_inst 991 77.5
loc_vignettes 925 88.8
loc_tests 1294 86.6
num_vignettes 3 91.0
n_fns_r 251 91.3
n_fns_r_exported 105 95.3 TRUE
n_fns_r_not_exported 146 88.0
n_fns_per_file_r 9 87.6
num_params_per_fn 4 51.1
loc_per_fn_r 48 88.7
loc_per_fn_r_exp 47 77.7
loc_per_fn_r_not_exp 48 89.6
rel_whitespace_R 24 98.5 TRUE
rel_whitespace_inst 23 81.5
rel_whitespace_vignettes 25 85.1
rel_whitespace_tests 31 91.6
doclines_per_fn_exp 20 15.3
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 157 84.3

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check.yaml

GitHub Workflow Results

id name conclusion sha run_number date
17965843714 pages build and deployment success 04c683 7 2025-09-24
17965695551 pkgdown.yaml success a008bc 4 2025-09-24
17965695558 R-CMD-check.yaml success a008bc 4 2025-09-24

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following check_fails:

  1. cyclocomp
  2. no_description_date

Test coverage with covr

Package coverage: 22.42

The following files are not completely covered by tests:

file coverage
R/abc_model.R 22.9%
R/comprehensive_summary.R 0%
R/heatmap_visualization.R 6.86%
R/performance_optimalization.R 19.75%
R/pubmed_search.R 0%
R/queries.R 15.5%
R/text_preprocessing.R 0%
R/utils.R 1.93%
R/visualization.R 49.63%
R/zzz.R 10%

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
is_valid_biomedical_entity 161
extract_entities_workflow 145
abc_model 129
sanitize_dictionary 100
vis_heatmap 99
vis_network 77
load_from_umls 71
validate_entity_with_nlp 57
extract_entities 54
pubmed_search 46
load_from_mesh 45
parse_pubmed_xml 45
create_comat 43
create_report 43
run_lbd 41
anc_model 38
load_dictionary 37
extract_ner 35
vis_abc_heatmap 35
export_chord_diagram 33
process_mesh_xml 32
validate_abc 29
abc_model_sig 27
lsi_model 27
abc_timeslice 26
map_ontology 26
shadowtext 26
abc_model_opt 24
eval_evidence 24
process_mesh_chunks 24
export_network 23
query_umls 23
apply_bitola_flexible 22
merge_entities 21
vis_abc_network 21
get_pmc_fulltext 20
validate_entity_comprehensive 20
vec_preprocess 20
bitola_model 19
create_sparse_comat 19
fetch_and_parse_pmc 19
find_abc_all 19
ncbi_search 19
cluster_docs 18
create_citation_net 17
load_mesh_terms_from_pubmed 17
create_tdm 16
create_term_document_matrix 16
extract_topics 16
preprocess_text 16
compare_terms 15
min_results 15

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 10 function names are duplicated in other packages:

    • create_report from DataExplorer, prodigenr, reporter
    • extract_entities from medExtractR
    • load_dictionary from ricu
    • merge_results from climwin
    • ncbi_search from taxize
    • parallel_analysis from kim
    • plot_heatmap from dendroTools, dynplot, greatR, MitoHEAR, omu, Plasmidprofiler, RolWinMulCor, romic
    • plot_network from cape, dbnR, HeteroGGM, immcp, imsig, LSVAR, SeqNet, SubgrPlots
    • save_results from data.validator
    • vis_heatmap from immunarch

Package Versions

package version
pkgstats 0.2.0.66
pkgcheck 0.1.2.230

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

ropensci-review-bot avatar Sep 24 '25 04:09 ropensci-review-bot

Thanks for the submission @chaoliu-cl ! The package sounds really neat. Let me know when you've been able to address the✖️ found in the check.

I'd be a little concerned about those high complexity values found in the goodpractices checks. Those files and functions are huge and look like there are logical places you could split up the code. One example might be to put all of the static lists in a sysdata.rda file.

ldecicco-USGS avatar Sep 25 '25 20:09 ldecicco-USGS

Hi @ldecicco-USGS ,

Thank you for the feedback. I have addressed the highlighted issues including the following:

  • Code Complexity & sysdata.rda Following your suggestion, I've extracted all static lists into a sysdata.rda file (acronym corrections, term mappings, common words, entity patterns, etc.). This significantly reduced function complexity by removing hundreds of lines of static definitions.
  • codemeta.json & CONTRIBUTING.md Both files have been added.
  • Function Examples Added runnable examples (without \dontrun{}) for all previously undocumented functions.
  • Test Coverage Improved from 22.4% to 75%.

chaoliu-cl avatar Oct 05 '25 01:10 chaoliu-cl

@ropensci-review-bot check package

ldecicco-USGS avatar Oct 10 '25 01:10 ldecicco-USGS

Thanks, about to send the query.

ropensci-review-bot avatar Oct 10 '25 01:10 ropensci-review-bot

:rocket:

Editor check started

:wave:

ropensci-review-bot avatar Oct 10 '25 01:10 ropensci-review-bot

Checks for LBDiscover (v0.1.0)

git hash: 60e965ad

  • :heavy_check_mark: Package is already on CRAN.
  • :heavy_check_mark: has a 'codemeta.json' file.
  • :heavy_check_mark: has a 'contributing' file.
  • :heavy_check_mark: uses 'roxygen2'.
  • :heavy_check_mark: 'DESCRIPTION' has a URL field.
  • :heavy_check_mark: 'DESCRIPTION' has a BugReports field.
  • :heavy_check_mark: Package has at least one HTML vignette
  • :heavy_multiplication_x: These functions do not have examples: [anc_model, create_report, lsi_model, query_external_api, query_mesh, query_umls, validate_biomedical_entity, validate_entity_comprehensive, validate_entity_with_nlp].
  • :heavy_check_mark: Package has continuous integration checks.
  • :heavy_check_mark: Package coverage is 75%.
  • :heavy_check_mark: R CMD check found no errors.
  • :heavy_check_mark: R CMD check found no warnings.
  • :eyes: Some goodpractice linters failed.
  • :eyes: Function names are duplicated in other packages
  • :eyes: Examples should not use \dontrun{} unless really necessary.

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with :eyes: may be optionally addressed.)

Package License: GPL-3


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 2234
internal LBDiscover 149
internal methods 9
internal usethis 2
imports stats 61
imports graphics 58
imports xml2 54
imports utils 51
imports httr 33
imports igraph 19
imports rentrez 9
imports Matrix 8
imports tools 3
imports grDevices 2
imports jsonlite 2
suggests visNetwork 8
suggests parallel 7
suggests irlba 2
suggests reticulate 2
suggests SnowballC 1
suggests spacyr 1
suggests digest 1
suggests openxlsx NA
suggests knitr NA
suggests rmarkdown NA
suggests base64enc NA
suggests testthat NA
suggests mockery NA
suggests covr NA
suggests withr NA
suggests htmltools NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

character (195), c (192), data.frame (146), for (125), length (124), nrow (112), list (82), sapply (77), min (76), max (68), rep (64), numeric (54), unique (52), paste0 (48), names (39), which (34), return (33), if (30), sum (30), integer (25), grepl (24), seq_along (22), unlist (22), attr (21), any (20), paste (19), tolower (19), tryCatch (19), ncol (18), rbind (18), is.na (17), is.null (16), matrix (16), ceiling (15), lapply (15), rownames (15), strsplit (15), table (15), colnames (14), nchar (13), as.numeric (12), vector (11), match (9), order (9), round (9), seq_len (9), regexpr (8), regmatches (8), setdiff (8), as.character (7), drop (7), gregexpr (7), ifelse (7), rowSums (7), sqrt (7), url (7), body (6), plot (6), range (6), sort (6), t (6), substr (5), diff (4), dim (4), grep (4), gsub (4), seq (4), sprintf (4), switch (4), col (3), diag (3), emptyenv (3), environment (3), file (3), log (3), logical (3), new.env (3), row (3), tapply (3), tempfile (3), all (2), apply (2), by (2), cat (2), colSums (2), dimnames (2), do.call (2), expression (2), mean (2), outer (2), row.names (2), sub (2), Sys.time (2), try (2), abs (1), as.data.frame (1), colMeans (1), difftime (1), duplicated (1), file.path (1), floor (1), format (1), interactive (1), match.arg (1), merge (1), mode (1), rank (1), suppressMessages (1), system.file (1), units (1), unname (1), version (1), which.max (1)

LBDiscover

retry_api_call (16), create_comat (4), load_dictionary (4), pubmed_search (4), string_similarity (4), throttle_api (4), abc_model (3), authenticate_umls (3), cluster_docs (3), count_corpus_terms (3), extract_entities (3), get_pubmed_cache (3), tokenize_text (3), vec_preprocess (3), calc_doc_sim (2), calculate_score (2), create_cache_key (2), create_dummy_dictionary (2), create_term_document_matrix (2), diversify_abc (2), extract_text_ngrams (2), get_color_palette (2), get_dict_cache (2), get_service_ticket (2), is_valid_biomedical_entity (2), load_dict_single (2), load_from_mesh (2), load_from_umls (2), load_mesh_terms_from_pubmed (2), process_mesh_xml (2), abc_model_opt (1), abc_model_sig (1), abc_timeslice (1), add_statistical_significance (1), alternative_validation (1), anc_model (1), apply_bitola_flexible (1), apply_correction (1), b_term_type_filter (1), bitola_model (1), calc_bibliometrics (1), clear_pubmed_cache (1), compare_terms (1), create_citation_net (1), create_report (1), create_single_heatmap (1), create_sparse_comat (1), create_tdm (1), create_vis_heatmap (1), detect_lang (1), diversify_b_terms (1), diversify_c_paths (1), enhance_abc_kb (1), eval_evidence (1), export_chord (1), export_chord_diagram (1), export_network (1), extract_entities_workflow (1), extract_mesh_from_text (1), extract_ner (1), extract_ngrams (1), extract_terms (1), extract_topics (1), fetch_and_parse_gene (1), fetch_and_parse_pmc (1), fetch_and_parse_protein (1), fetch_and_parse_pubmed (1), filter_by_type (1), filter_terms_for_abc_model (1), find_abc_all (1), find_similar_docs (1), find_term (1), gen_report (1), get_pmc_fulltext (1), get_term_vars (1), get_type_dist (1), get_umls_semantic_types (1), has_general_biomedical_characteristics (1), is_valid_type (1), list_to_df (1), load_results (1), lsi_model (1), parse_pubmed_xml (1), preprocess_text (1), process_batch (1), split_into_sentences (1), split_text (1)

stats

df (19), terms (16), p.adjust (5), phyper (4), kmeans (3), profile (3), aggregate (2), runif (2), setNames (2), smooth (2), complete.cases (1), dist (1), pt (1)

graphics

text (29), par (13), title (8), layout (6), arrows (2)

xml2

xml_find_first (19), xml_text (19), xml_find_all (10), read_xml (4), xml_attr (1), xml_name (1)

utils

txtProgressBar (40), read.csv (4), adist (2), write.csv (2), de (1), head (1), URLencode (1)

httr

content (18), GET (8), POST (5), headers (2)

igraph

graph_from_data_frame (12), layout_with_fr (6), degree (1)

methods

new (9)

rentrez

entrez_link (3), entrez_search (3), entrez_fetch (2), entrez_summary (1)

Matrix

t (4), diag (2), sparseMatrix (2)

visNetwork

visEdges (2), visGroups (2), visNetwork (2), visLayout (1), visSave (1)

parallel

clusterExport (3), parLapply (2), detectCores (1), makeCluster (1)

tools

file_ext (3)

grDevices

colorRampPalette (1), rainbow (1)

irlba

irlba (2)

jsonlite

fromJSON (2)

reticulate

import (2)

usethis

use_data (2)

digest

digest (1)

SnowballC

wordStem (1)

spacyr

spacy_parse (1)


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 14 files) and
  • 1 authors
  • 3 vignettes
  • no internal data file
  • 11 imported packages
  • 107 exported functions (median 46 lines of code)
  • 150 non-exported functions in R (median 49 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 14 68.5
files_inst 4 97.0
files_vignettes 3 89.3
files_tests 28 96.9
loc_R 8267 97.3 TRUE
loc_inst 991 77.7
loc_vignettes 946 89.1
loc_tests 9585 99.1 TRUE
num_vignettes 3 91.0
n_fns_r 257 91.5
n_fns_r_exported 107 95.4 TRUE
n_fns_r_not_exported 150 88.4
n_fns_per_file_r 9 86.5
num_params_per_fn 4 51.2
loc_per_fn_r 47 88.3
loc_per_fn_r_exp 46 77.1
loc_per_fn_r_not_exp 50 90.0
rel_whitespace_R 24 98.4 TRUE
rel_whitespace_inst 23 81.8
rel_whitespace_vignettes 24 85.4
rel_whitespace_tests 27 99.6 TRUE
doclines_per_fn_exp 21 17.0
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 160 84.6

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check.yaml

GitHub Workflow Results

id name conclusion sha run_number date
18251834739 pages build and deployment success d37d15 22 2025-10-05
18251738033 pkgdown.yaml success 60e965 19 2025-10-05
18251738031 R-CMD-check.yaml success 60e965 19 2025-10-05
18251738027 test-coverage success 60e965 14 2025-10-05

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following check_fails:

  1. cyclocomp
  2. no_description_date

Test coverage with covr

Package coverage: 74.96

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
extract_entities_workflow 145
abc_model 132
sanitize_dictionary 100
vis_heatmap 99
vis_network 80
load_from_umls 71
validate_entity_with_nlp 57
extract_entities 54
load_from_mesh 47
pubmed_search 46
parse_pubmed_xml 45
create_comat 43
create_report 43
run_lbd 41
is_valid_biomedical_entity 40
anc_model 38
load_dictionary 37
extract_ner 35
vis_abc_heatmap 35
export_chord_diagram 33
process_mesh_xml 32
validate_abc 29
abc_model_sig 27
abc_timeslice 26
map_ontology 26
shadowtext 26
abc_model_opt 24
eval_evidence 24
process_mesh_chunks 24
export_network 23
lsi_model 23
query_umls 23
apply_bitola_flexible 22
merge_entities 21
vis_abc_network 21
get_pmc_fulltext 20
validate_entity_comprehensive 20
vec_preprocess 20
bitola_model 19
create_sparse_comat 19
extract_ngrams 19
fetch_and_parse_pmc 19
find_abc_all 19
ncbi_search 19
preprocess_text 19
cluster_docs 18
create_citation_net 17
create_term_document_matrix 17
load_mesh_terms_from_pubmed 17
create_tdm 16
extract_topics 16
validate_term_by_type 16
compare_terms 15
min_results 15

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

:heavy_multiplication_x: The following 10 function names are duplicated in other packages:

    • create_report from DataExplorer, prodigenr, reporter
    • extract_entities from medExtractR
    • load_dictionary from ricu
    • merge_results from climwin
    • ncbi_search from taxize
    • parallel_analysis from kim
    • plot_heatmap from dendroTools, dynplot, greatR, MitoHEAR, omu, Plasmidprofiler, RolWinMulCor, romic
    • plot_network from cape, dbnR, HeteroGGM, immcp, imsig, LSVAR, SeqNet, SubgrPlots
    • save_results from data.validator
    • vis_heatmap from immunarch

Package Versions

package version
pkgstats 0.2.0.68
pkgcheck 0.1.2.233

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

ropensci-review-bot avatar Oct 10 '25 02:10 ropensci-review-bot

Thanks for the update. I think if you change the dontrun to donttest, the rOpenSci checks will pass. In the meantime, I'll clone the package and take it for a test drive 🏎️

ldecicco-USGS avatar Oct 17 '25 18:10 ldecicco-USGS

Hi @ldecicco-USGS , is there any update on the review?

chaoliu-cl avatar Nov 10 '25 22:11 chaoliu-cl

@chaoliu-cl Sorry for the delay on this. My turn as EIC started on Nov 1 but I forgot about it and the reminders got lost in the shutdown! Let me dig back into this and will let you know shortly.

jhollist avatar Nov 14 '25 19:11 jhollist

@chaoliu-cl Have had some time to take a look at this and have had a chance to chat with some of the other rOpenSci editors.

LBDiscover is definitely a good fit for rOpenSci; however, we do have some concerns with the size of the package (8000+ lines of code and 100+ exported functions). Given the scope of your goals for LBDiscover it makes sense that it is big, but it may introduce some challenges in finding reviewers willing to commit to a review of that scale. Prior to passing this to a handling editor, I wanted to ask if you would consider breaking the package into two separate packages.

In your readme you list the 7 key features of the package (https://github.com/chaoliu-cl/LBDiscover#key-features). Based on these would it be possible to split it with the first three (Data Retrieval, Text Preprocessing, and Entity Extraction) into a data access/processing focused package and the final four (Co-occurrence Analysis, Discovery Models, Validation, and Visualization) into a data analysis/visualization package?

This is not necessarily a requirement for review as I know this would add additional work on the front end for you, but in the long run we believe it would make for easier review and easier long-term maintenance of the package.

Thoughts?

jhollist avatar Dec 02 '25 21:12 jhollist