software-review icon indicating copy to clipboard operation
software-review copied to clipboard

pkgmatch: Find R Packages Matching Either Descriptions or Other R Packages

Open mpadge opened this issue 1 year ago • 48 comments
trafficstars

Submitting Author Name: Mark Padgham Submitting Author Github Handle: @mpadge Repository: https://github.com/ropensci-review-tools/pkgmatch Version submitted: 0.4.2 Submission type: Standard Editor: @MargaretSiple-NOAA Reviewers: @agricolamz, @Selbosh

Due date for @Selbosh: 2025-02-21

Due date for @agricolamz: 2025-02-23 Archive: TBD Version accepted: TBD Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: pkgmatch
Title:  Find R Packages Matching Either Descriptions or Other R Packages
Version: 0.4.2
Authors@R: c(
    person("Mark", "Padgham", , "[email protected]", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0003-2172-5265")),
    person("Davis", "Vaughan", , "[email protected]", role = c("ctb"))
    )
Description: Find R packages matching either descriptions or other R packages.
License: MIT + file LICENSE
URL: https://docs.ropensci.org/pkgmatch/,
    https://github.com/ropensci-review-tools/pkgmatch
BugReports: https://github.com/ropensci-review-tools/pkgmatch/issues
Imports:
    brio,
    checkmate,
    cli,
    curl,
    dplyr,
    fs,
    httr2,
    memoise,
    pbapply,
    Rcpp,
    rvest,
    tibble,
    tidyr,
    tokenizers,
    treesitter,
    treesitter.r,
    vctrs
Suggests:
    gert,
    hms,
    httptest2,
    jsonlite,
    piggyback,
    pkgbuild,
    rappdirs,
    roxygen2,
    testthat (>= 3.0.0),
    withr,
    knitr,
    rmarkdown
LinkingTo:
    Rcpp
Depends: R (>= 3.5.0)
NeedsCompilation: yes
Encoding: UTF-8
Language: en-GB
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • [x] data retrieval
    • [ ] data extraction
    • [ ] data munging
    • [ ] data deposition
    • [ ] data validation and testing
    • [ ] workflow automation
    • [ ] version control
    • [ ] citation management and bibliometrics
    • [x] scientific software wrappers
    • [ ] field and lab reproducibility tools
    • [ ] database software bindings
    • [ ] geospatial data
    • [ ] text analysis
    • [x] rOpenSci tools
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Data retrieval, because the package includes code to generate language model (LM) embeddings from all R packages retrieved from both CRAN and rOpenSci package repositories. Wrapper because LM embeddings are generated by wrapping interface to ollama software. Plus I've inserted a new, one-off category of "rOpenSci tools" for internal, staff-curated packages.

  • Who is the target audience and what are scientific applications of this package?

Beyond internal rOpenSci use, target audiences are (1) entirely general audience of those interested in searching R packages using either text or code input, and (2) package developers, who can use this package to identify similar packages or functions to code they might be working on.

No, not at all. There are to my knowledge two other R packages for interfacing with LMs: tidyllm and elmer. Both of these are general interfaces to LM API endpoints, while this package specifically uses LM outputs to identify best-matching packages.

Not applicable.

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

  • Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • [x] Do you intend for this package to go on CRAN?

  • [ ] Do you intend for this package to go on Bioconductor?

  • [ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options
  • [ ] The package is novel and will be of interest to the broad readership of the journal.
  • [ ] The manuscript describing the package is no longer than 3000 words.
  • [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

  • [x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

mpadge avatar Nov 07 '24 16:11 mpadge

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot avatar Nov 07 '24 16:11 ropensci-review-bot

:rocket:

Editor check started

:wave:

ropensci-review-bot avatar Nov 07 '24 16:11 ropensci-review-bot

Checks for pkgmatch (v0.4.2)

git hash: f12ad732

  • :heavy_check_mark: Package name is available
  • :heavy_check_mark: has a 'codemeta.json' file.
  • :heavy_check_mark: has a 'contributing' file.
  • :heavy_check_mark: uses 'roxygen2'.
  • :heavy_check_mark: 'DESCRIPTION' has a URL field.
  • :heavy_check_mark: 'DESCRIPTION' has a BugReports field.
  • :heavy_check_mark: Package has at least one HTML vignette
  • :heavy_check_mark: All functions have examples.
  • :heavy_check_mark: Package has continuous integration checks.
  • :heavy_check_mark: Package coverage is 79.9%.
  • :heavy_check_mark: R CMD check found no errors.
  • :heavy_check_mark: R CMD check found no warnings.

Package License: MIT + file LICENSE


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 545
internal pkgmatch 204
internal utils 25
internal stats 12
internal tools 5
imports fs 47
imports checkmate 20
imports dplyr 17
imports memoise 13
imports treesitter 8
imports httr2 7
imports pbapply 5
imports brio 2
imports rvest 2
imports tibble 2
imports tokenizers 2
imports treesitter.r 1
imports cli NA
imports curl NA
imports Rcpp NA
imports tidyr NA
imports vctrs NA
suggests gert 2
suggests jsonlite 2
suggests hms 1
suggests piggyback 1
suggests httptest2 NA
suggests pkgbuild NA
suggests rappdirs NA
suggests roxygen2 NA
suggests testthat NA
suggests withr NA
suggests knitr NA
suggests rmarkdown NA
linking_to Rcpp NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (46), data.frame (31), which (27), names (24), vapply (23), length (19), nrow (17), grep (16), c (15), list (15), paste0 (14), seq_len (12), character (11), gsub (11), as.integer (10), by (10), ncol (10), tryCatch (10), unname (10), url (10), order (9), grepl (7), colnames (6), for (6), integer (6), readRDS (6), unlist (6), version (6), basename (5), colSums (5), format (5), ifelse (5), raw (5), seq (5), seq_along (5), tempdir (5), all (4), as.Date (4), asNamespace (4), do.call (4), is.null (4), strsplit (4), system (4), attr (3), difftime (3), getOption (3), log (3), matrix (3), proc.time (3), read.dcf (3), sqrt (3), table (3), unique (3), as.character (2), cbind (2), floor (2), is.na (2), ls (2), match (2), min (2), nzchar (2), options (2), regmatches (2), rowSums (2), sum (2), units (2), any (1), apply (1), as.matrix (1), class (1), cut (1), drop (1), file (1), gregexpr (1), list.files (1), logical (1), mean (1), new.env (1), parseNamespaceFile (1), paste (1), rank (1), rbind (1), readline (1), regexpr (1), rep (1), sort (1), switch (1), Sys.Date (1), Sys.getenv (1), Sys.time (1), system.file (1), tolower (1), unclass (1), vector (1)

pkgmatch

bm25_tokens_list (8), get_embeddings (7), not_null_index (7), bm25_idf (6), get_pkg_text (6), get_pkg_code (5), pkgmatch_bm25 (5), cosine_similarity (4), dl_prev_data (4), pkgmatch_embeddings_from_pkgs (4), rm_fns_from_pkg_txt (4), bm25_tokens (3), get_all_fn_descs (3), get_cache_file_name (3), get_embeddings_from_ollama (3), jina_model (3), pkgmatch_bm25_from_idf (3), pkgmatch_load_data (3), pkgmatch_treesitter_fn_tags (3), append_cols (2), attach_ns (2), bm25_idf_internal (2), bm25_tokens_internal (2), bm25_tokens_list_internal (2), days_in_this_month (2), dl_one_tarball (2), extract_tarball (2), get_calls (2), get_calls_in_functions (2), get_embeddings_intern (2), get_fn_defs_namespace (2), get_fn_descs_from_ns (2), get_local_pkg_dep_fns (2), get_local_pkg_deps (2), get_pkg_readme (2), get_pkg_text_internal (2), get_pkg_text_namespace (2), input_is_pkg (2), is_docker_sudo (2), is_windows (2), list_new_cran_updates (2), load_data_internal (2), m_list_remote_files (2), ollama_dl_jina_model (2), opt_is_quiet (2), pkg_fns_from_r_search (2), pkg_fns_from_r_search_internal (2), pkg_is_installed (2), pkg_name_from_path (2), pkgmatch_bm25_fn_calls (2), pkgmatch_bm25_fn_calls_internal (2), pkgmatch_bm25_from_idf_internal (2), pkgmatch_bm25_internal (2), pkgmatch_cache_path (2), pkgmatch_update_cran (2), append_data_to_bm25 (1), append_data_to_embeddings (1), append_data_to_fn_calls (1), apply_col_names (1), attach_base_rcmd_ns (1), attach_local_dep_namespaces (1), attach_this_pkg_namespace (1), convert_paths_to_pkgs (1), desc_template (1), extract_data_from_local_dir (1), fn_names_base (1), fn_names_rcmd (1), get_fn_defs_local (1), get_pkg_exported_fns (1), get_pkg_text_local (1), has_ollama (1), has_ollama_docker (1), has_ollama_local (1), head.pkgmatch (1), input_is_path (1), input_mentions_functions (1), make_cran_version_column (1), modify_by_lm_prop (1), ollama_check (1), ollama_has_jina_model (1), ollama_is_running (1), ollama_models (1), order_output (1), pkg_install_path (1), pkgmatch_browse (1), pkgmatch_cache_update_interval (1), pkgmatch_dl_data (1), pkgmatch_embeddings_from_text (1), pkgmatch_rerank (1), pkgmatch_similar_fns (1), pkgmatch_similar_pkgs (1), pkgmatch_update_data (1), pkgmatch_update_ropensci (1), rcmd_pkgs (1), rcpp_bm25 (1), registry_daily_chunk (1), rename_files_in_r (1), ros_registry (1), similar_pkgs_from_pkg (1), similar_pkgs_from_pkg_internal (1), similarity_embeddings (1), tok_lists_to_idfs (1), tressitter_calls_in_package (1)

fs

path (19), dir_ls (9), path_temp (7), dir_create (5), path_ext (3), file_exists (1), file_info (1), path_ext_set (1), path_real (1)

utils

installed.packages (4), lsf.str (4), data (3), packageDescription (3), prompt (3), tar (2), untar (2), browseURL (1), getFromNamespace (1), tail (1), timestamp (1)

checkmate

assert_character (7), assert_integerish (3), assert_matrix (2), assert_names (2), assert_numeric (2), check_file_exists (2), assert_list (1), assert_logical (1)

dplyr

left_join (8), rename (3), mutate (2), last_col (1), n (1), relocate (1), summarise (1)

memoise

memoise (13)

stats

dt (5), start (3), end (2), line (2)

treesitter

query_captures (3), node_text (2), parser (1), parser_parse (1), tree_root_node (1)

httr2

req_headers (2), request (2), resp_body_json (2), req_perform (1)

pbapply

pblapply (5)

tools

parse_Rd (2), Rd_db (2), CRAN_package_db (1)

brio

read_lines (2)

gert

git_clone (2)

jsonlite

read_json (2)

rvest

html_table (1), read_html (1)

tibble

new_tibble (2)

tokenizers

count_words (1), tokenize_words (1)

hms

hms (1)

piggyback

pb_download (1)

treesitter.r

language (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (5% in 2 files) and R (95% in 20 files)
  • 1 authors
  • 6 vignettes
  • no internal data file
  • 17 imported packages
  • 14 exported functions (median 14 lines of code)
  • 218 non-exported functions in R (median 12 lines of code)
  • 4 R functions (median 12 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 20 79.8
files_src 2 79.5
files_vignettes 6 96.8
files_tests 10 87.4
loc_R 1905 81.1
loc_src 91 13.6
loc_vignettes 463 74.2
loc_tests 694 77.2
num_vignettes 6 97.6 TRUE
n_fns_r 232 90.5
n_fns_r_exported 14 56.0
n_fns_r_not_exported 218 93.0
n_fns_src 4 21.1
n_fns_per_file_r 7 79.5
n_fns_per_file_src 2 27.8
num_params_per_fn 2 8.2
loc_per_fn_r 12 36.8
loc_per_fn_r_exp 14 33.6
loc_per_fn_r_not_exp 12 39.8
loc_per_fn_src 12 38.9
rel_whitespace_R 24 85.6
rel_whitespace_src 26 21.8
rel_whitespace_vignettes 20 57.6
rel_whitespace_tests 21 77.1
doclines_per_fn_exp 28 29.2
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 187 87.0

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

R-CMD-check

GitHub Workflow Results

id name conclusion sha run_number date
11727438106 docker skipped f12ad7 23 2024-11-07
11727438100 pkgcheck NA f12ad7 96 2024-11-07
11727438103 R-CMD-check success f12ad7 292 2024-11-07
11727438110 test-coverage success f12ad7 292 2024-11-07
11727438101 Update pkgmatch data NA f12ad7 66 2024-11-07

3b. goodpractice results

R CMD check with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 79.93

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function cyclocomplexity
get_pkg_readme 17

Static code analyses with lintr

lintr found no issues with this package!


Package Versions

package version
pkgstats 0.2.0.47
pkgcheck 0.1.2.63

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

ropensci-review-bot avatar Nov 07 '24 17:11 ropensci-review-bot

@ropensci-review-bot assign @MargaretSiple-NOAA as editor

emilyriederer avatar Nov 23 '24 14:11 emilyriederer

Assigned! @MargaretSiple-NOAA is now the editor

ropensci-review-bot avatar Nov 23 '24 14:11 ropensci-review-bot

@ropensci-review-bot seeking reviewers

MargaretSiple-NOAA avatar Dec 04 '24 00:12 MargaretSiple-NOAA

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/671_status.svg)](https://github.com/ropensci/software-review/issues/671)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

ropensci-review-bot avatar Dec 04 '24 00:12 ropensci-review-bot

(just a note: I put 'seeking reviewers' before putting in my editor checks but I have not forgotten them. I started them today and will continue tomorrow.)

MargaretSiple-NOAA avatar Dec 04 '24 01:12 MargaretSiple-NOAA

Editor checks:

  • [X] Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • [X] Is the case for the package well made?
    • [X] Is the reference index page clear (grouped by topic if necessary)?
    • [X] Are vignettes readable, sufficiently detailed and not just perfunctory?
  • [X] Fit: The package meets criteria for fit and overlap.
  • [x] Installation instructions: Are installation instructions clear enough for human users? - Yes but see notes below.
  • [x] Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling? Yes but see notes below. I think I just can't fully evaluate this until I can install docker and ollama on my personal computer and test the functions
  • [X] Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • [X] License: The package has a CRAN or OSI accepted license.
  • [X] Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

Nice package, @mpadge ! This will be useful for anyone working on package development or review. I wrote a few notes on what I saw during the editor check process-- they're longer than I usually write, but I figure these will probably make the reviewers' lives easier so it doesn't hurt to mention them early.

A few notes of mine from the editor check process:

  • The pkgdown page seems to indicate that the user can set the corpus to rOpenSci or to CRAN, but some functions only provide search results for rOpenSci, e.g., pkgmatch_bm25() requires the corpus to be rOpenSci-related. Some clarity about which functions apply to which package corpi (?) might be helpful.

  • I got 1 note from running devtools::check(): "Package suggested but not available for checking: 'piggyback'" -- I don't worry about these too much just wanted to flag it just in case.

  • The ollama connection gave me some trouble at the start, so I imagine future users might struggle on this. I recommend revising the documentation a little bit to make it more obvious what needs to happen with ollama before users can get started: a) Consider adding a sentence to the top of the "Get started" article that indicates that docker and ollama installs need to happen first and foremost. Here, I would link to the "ollama" article. b) Change the title of the "ollama" article to say something like "Before you begin: ollama" or "Before you begin: ollama installation", so that when people are browsing, it's very clear to them that they'll need to read that article first.

  • The beginning of the Docker section in the ollama article should indicate that users need to have Docker installed as well. I learned from this process that my institution actually doesn't have a Docker license! This may impede editors/reviewers who work at NOAA. It sounds like they're working on getting a license but for now I don't have access to it.

  • I had some trouble getting test coverage my usual way (covr::package_coverage() fails with an error about a temp file)... but devtools::test() works great). Some tests are failing but I'm pretty sure that's because I don't have ollama properly installed on my machine. It might be nice to have a one-line check in one of the early vignettes that can show people whether they are missing any of the components need to run the fns.


MargaretSiple-NOAA avatar Dec 04 '24 23:12 MargaretSiple-NOAA

Thanks @MargaretSiple-NOAA for really useful feedback! The issue linked above has details of changes made in response. I'd say the most important of those for reviewers of this package is that I've added tests/README explaining that tests can be run without having ollama installed or running anywhere.

mpadge avatar Dec 05 '24 10:12 mpadge

@ropensci-review-bot Add @agricolamz as reviewer

MargaretSiple-NOAA avatar Dec 12 '24 19:12 MargaretSiple-NOAA

@agricolamz added to the reviewers list. Review due date is 2025-01-02. Thanks @agricolamz for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot avatar Dec 12 '24 19:12 ropensci-review-bot

@agricolamz: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot avatar Dec 12 '24 19:12 ropensci-review-bot

@ropensci-review-bot set due date for @agricolamz to 2025-01-31

MargaretSiple-NOAA avatar Dec 12 '24 19:12 MargaretSiple-NOAA

Review due date for @agricolamz is now 31-January-2025

ropensci-review-bot avatar Dec 12 '24 19:12 ropensci-review-bot

Hi @mpadge -- I was just revising my editor checks above after your revisions, and everything is looking good except some issues I had with devtools::test() - these are documented in the issue I logged on the pkgmatch github page (above). I don't know what they mean but I think we should have a resolution before reviewers are running similar checks.

MargaretSiple-NOAA avatar Jan 08 '25 01:01 MargaretSiple-NOAA

Thanks @MargaretSiple-NOAA, I've fixed the issue linked above, so devtools::test() should now work for everybody.

mpadge avatar Jan 09 '25 13:01 mpadge

Thank for for addressing that , @mpadge ! The tests now run without issues.

Two more quick things:

  • pkgmatch_similar_pkgs(..., corpus = "cran") always gives a warning Error in matrix(emb[[what]], nrow = nrow, ncol = npkgs) : non-numeric matrix extent. This renders fine on the pkgdown page here so I suspect it is a computer-specific issue. Everything works as expected when corpus = "ropensci".
  • Several of the tests fail and trace back to parseNamespaceFile(pkg_name, package.lib = lp). I think it's fine to fix this when reviews are in but wanted to give you a heads-up while I'm still searching for a Reviewer 2.

MargaretSiple-NOAA avatar Jan 14 '25 23:01 MargaretSiple-NOAA

@ropensci-review-bot add @Selbosh as reviewer

MargaretSiple-NOAA avatar Jan 27 '25 18:01 MargaretSiple-NOAA

@Selbosh added to the reviewers list. Review due date is 2025-02-17. Thanks @Selbosh for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot avatar Jan 27 '25 18:01 ropensci-review-bot

@Selbosh: If you haven't done so, please fill this form for us to update our reviewers records.

ropensci-review-bot avatar Jan 27 '25 18:01 ropensci-review-bot

@ropensci-review-bot Set due date for @Selbosh to 2025-02-21

MargaretSiple-NOAA avatar Jan 28 '25 17:01 MargaretSiple-NOAA

Review due date for @Selbosh is now 21-February-2025

ropensci-review-bot avatar Jan 28 '25 17:01 ropensci-review-bot

Dear @mpadge this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see.

I see one review is due about now and the second one on Feb 21st. It all looks good to me. I'll step back. Thanks!

maurolepore avatar Feb 02 '25 17:02 maurolepore

@ropensci-review-bot Set due date for @agricolamz to 2025-02-21

MargaretSiple-NOAA avatar Feb 03 '25 06:02 MargaretSiple-NOAA

Review due date for @agricolamz is now 21-February-2025

ropensci-review-bot avatar Feb 03 '25 06:02 ropensci-review-bot

@mpadge Having some issues getting ollama to cooperate on Ubuntu; unfortunately this also causes the package to give a cryptic error message.

pkgmatch_similar_pkgs(input)
Error in names(out) <- nms : 
  'names' attribute [1] must be the same length as the vector [0]

Found some documentation here so will try to figure it out and get the review done soon.

Selbosh avatar Feb 12 '25 17:02 Selbosh

I think I found one half of the the problem: set_ollama_url does not modify the system environment variables, so the system call to ollama list within ollama_check() will fail depending on how you have set up your shell environment variables (e.g. if you export OLLAMA_HOST=localhost:8888 but then open R in another session, then the exported variable is lost).

In addition, ollama_check() fails and gives the error above if ollama is running, but without any models downloaded (it's trying to iterate over a list of length zero, I guess?)

So the current sequence, after setting ollama serve running on localhost:8888 in another shell and manually pulling the two embeddings in the vignette, is:

Sys.getenv('OLLAMA_URL')
## [1] ""
library(pkgmatch)
ollama_check()
## Error: could not connect to ollama app, is it running?
## Error in out[[1]] : subscript out of bounds
## In addition: Warning message:
## In system("ollama list", intern = TRUE) :
##  running command 'ollama list' had status 1
set_ollama_url('127.0.0.1:8888')
ollama_check()
## Error: could not connect to ollama app, is it running?
## Error in out[[1]] : subscript out of bounds
## In addition: Warning message:
## In system("ollama list", intern = TRUE) :
##  running command 'ollama list' had status 1
Sys.setenv(OLLAMA_HOST = '127.0.0.1:8888')
ollama_check()
## [1] TRUE

But then I still do not get much farther, because

pkgmatch_similar_pkgs(input)
## Error in `httr2::req_perform()`:
## ! Failed to perform HTTP request.
## Caused by error in `curl::curl_fetch_memory()`:
## ! Couldn't connect to server [localhost]: Failed to connect to localhost port 11434 after 1 ms: Connection refused
## Run `rlang::last_trace()` to see where the error occurred.

which implies that there is least one other place that 11434 is hard-coded. A quick search found

https://github.com/ropensci-review-tools/pkgmatch/blob/21c1e138535f86de6501ec8549f7c8a6f6ac4bf9/R/embeddings.R#L240

which unfortunately means I can't proceed on Ubuntu without fixing this bug first.

Selbosh avatar Feb 12 '25 17:02 Selbosh

Package Review

Hello @mpadge and @MargaretSiple-NOAA! Thank you giving me the opportunity to review the package. It is my first package review, and I am still learning the process, so I hope that my feedback is useful. As mentioned above, I had a couple of issues getting started, but have found some fixes that should with any luck make it easier for others who encounter similar problems.

Please let me know if I can clarify anything. Here are my comments:

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • [x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • [ ] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
  • [x] Installation instructions: for the development version of package and any non-standard dependencies in README
  • [x] Vignette(s): demonstrating major functionality that runs successfully locally
  • [x] Function Documentation: for all exported functions
  • [x] Examples: (that run successfully locally) for all exported functions
  • [x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • [x] Installation: Installation succeeds as documented.
  • [x] Functionality: Any functional claims of the software have been confirmed.
  • [x] Performance: Any performance claims of the software have been confirmed.
  • [x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • [x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 5

  • [x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Documentation

In the README, the statement of need should be more prominent; it's not obvious who the target audience of the package is or why the package is needed. It would be great if you could explain early on, clearly, what problem pkgmatch solves, maybe giving examples of use cases from the start.

I think that you could improve the documentation by making it more self-contained and making it easier for the user to find information quickly. Some information that should be in the README is only in a vignette, and some information that should be at the start of one vignette is instead left as a link to another vignette. The dependence on LLMs is also somewhat buried: this could be a make-or-break feature for many users, but it's only mentioned at the bottom of the Installation section. I can imagine it could be a selling point that pkgmatch exploits LLMs, but it could also be a deal-breaker for some users either due to technical limitations or ethical concerns.

The package relies on ollama but it's not explained in the README what this is or why the package should depend on it, or whether you can use some functions without it. There is a function called ollama_check() that ostensibly checks for and installs ollama: add this to the README, as it seems like the first thing you would want to run after loading the package! "Procedures for setting that up are described in a separate vignette" is not super helpful without a direct link to or giving the name of said vignette! Ideally I'd put much of this information in the README, because it could be a decider on whether somebody can/will install the package in the first place. Given we are talking about local LLMs here, not everybody will know what their computational demands are, so it would be helpful to give some general indication of system requirements in the 'Before you begin' vignette and maybe in the README.

Generally speaking, I think the documentation assumes a bit too much knowledge on the part of the user, by not motivating the task clearly enough, not explaining what some functions do or what some terms mean. There are one or two functions documented in terms of what they are not rather than what they are, which can be a bit confusing (e.g. pkgmatch_bm25_fn_calls()). Some function descriptions (e.g. pkgmatch_bm25()) simply comprise an external Web link, which is not ideal for offline use, nor for users who want to know what a function does without having to click around. For the same functions, the current Titles in the documentation are too long or specific and would be better off as Descriptions, and the current Descriptions as Details or Notes.

The purpose of some of the vignettes is unclear. For example, "Good-enough practices for language model packages": who is this for, exactly? Why does it need to be a vignette? Why do "Data caching and updating" or "Why local LMs?" need their own vignettes, instead of being included as sections in a larger general guide? This information is surely useful, but it could be merged into a single vignette and/or the README, maybe with a dedicated vignette for developers, to better signpost the user to the information they need (or don't need).

Functionality

I was able to install the package (v0.4.3.018) first try on WSL2 (Ubuntu on Windows 11) and didn't face any issues with building vignettes or R CMD CHECK. All tests passed successfully.

Not really sure where to start, I decided to run the example in the README:

pkgmatch_similar_pkgs(input)

This threw an error:

Error in names(out) <- nms : 
  'names' attribute [1] must be the same length as the vector [0]

I then remembered that I hadn't started the ollama server, so I go to the command line and run ollama serve, to get

Error: listen tcp 127.0.0.1:11434: bind: address already in use

This might be quite esoteric but could be worth mentioning in the vignette. A bit of detective work suggests that the error message above is in fact coming from ollama_check(), which when run throws exactly the same error. This left me a bit stuck, due to a couple of issues in the error handling, and due to a bug in the embeddings.R file. I made a local patch to fix it on my end and was able to proceed from here.

Once I got the package working, the benefits of it seem clear! I can see this being immensely useful to package developers and advanced users who are looking for existing functionality, either for specialist tasks or trying not to reinvent the wheel while developing their own packages. However, as described above, the documentation comes across as being aimed at people who are already familiar with the package, rather than to new users who are trying to get it to work for the first time.

Another interface issue I note is that with most function calls, there is a slight delay and we see that data is being downloaded in the background:

pkgmatch_similar_pkgs (".", corpus = "cran")
##  [100%] Downloaded 231213455 bytes...
##  [100%] Downloaded 36834164 bytes...
##  [100%] Downloaded 3948804 bytes...
##  [100%] Downloaded 13016426 bytes...

but what data is being downloaded, why and where to? Is it the corpus? Is it the embeddings? Maybe we should ask the user if they'd like to use up their bandwidth and disk space on this, or reassure them that it's going to a temporary directory and/or can easily be cleaned out at the end of the session? Which functionality will work offline, if any? This could be made a bit clearer on the first use.

The browse = TRUE function threw a further error for me, possibly related to using VS Code.

pkgmatch_similar_fns (input, browse = TRUE)
# VSCode WebView only supports showing local http content.
# Opening in external browser...
# Browsing https://docs.ropensci.org/charlatan/reference/SequenceProvider
# Error in vapply(urls, utils::browseURL, integer(1L)) : 
#   values must be length 1,
#  but FUN(X[[1]]) result is length 0

However, the help page successfully opened in the browser, so I'm not sure what the issue is here. I was able to run other examples successfully locally, though nearly all of them are 'Don't run' because of the external dependencies; I wonder if there is the possibility to embed some toy embeddings and system.files in the package?

Other comments

In the examples and README, you have spaces between function names and the brackets, e.g. fun () instead of fun(), which is not recommended in most style guides.

Overall I think this is a good idea for a package with real potential, but the main thing holding it back is selling to the user what it's for and giving an easy way to get started. To do this, I'd suggest filling in the gaps and maybe focus on having a vignette with an extended use case that shows off the package's capabilities. There are also one or two bugs or edge cases that thwarted me personally, but might not come up for other users, and so this might have coloured my impressions of the package.

In the long run, I'd suggest clearly marking out which functionality is dependent on which external tools, as then a new user might be able to explore some features even if they have issues with others or do not want to set them up yet.

Selbosh avatar Feb 12 '25 18:02 Selbosh

Thank you very much @Selbosh for your review. You've really immediately picked up the big weakness of package design thus far: It was pulled together for a very specific use case, with documentation mostly hobbled together around that narrow vision, rather than me having thought nearly as much as I should have about general usage. In short: I agree with all of your concerns, and will address them hopefully within the coming fortnight. You should see several new issues in the repo linked immediately below here.

mpadge avatar Feb 14 '25 12:02 mpadge