software-review
software-review copied to clipboard
pkgmatch: Find R Packages Matching Either Descriptions or Other R Packages
Submitting Author Name: Mark Padgham Submitting Author Github Handle: @mpadge Repository: https://github.com/ropensci-review-tools/pkgmatch Version submitted: 0.4.2 Submission type: Standard Editor: @MargaretSiple-NOAA Reviewers: @agricolamz, @Selbosh
Due date for @Selbosh: 2025-02-21Due date for @agricolamz: 2025-02-23 Archive: TBD Version accepted: TBD Language: en
- Paste the full DESCRIPTION file inside a code block below:
Package: pkgmatch
Title: Find R Packages Matching Either Descriptions or Other R Packages
Version: 0.4.2
Authors@R: c(
person("Mark", "Padgham", , "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2172-5265")),
person("Davis", "Vaughan", , "[email protected]", role = c("ctb"))
)
Description: Find R packages matching either descriptions or other R packages.
License: MIT + file LICENSE
URL: https://docs.ropensci.org/pkgmatch/,
https://github.com/ropensci-review-tools/pkgmatch
BugReports: https://github.com/ropensci-review-tools/pkgmatch/issues
Imports:
brio,
checkmate,
cli,
curl,
dplyr,
fs,
httr2,
memoise,
pbapply,
Rcpp,
rvest,
tibble,
tidyr,
tokenizers,
treesitter,
treesitter.r,
vctrs
Suggests:
gert,
hms,
httptest2,
jsonlite,
piggyback,
pkgbuild,
rappdirs,
roxygen2,
testthat (>= 3.0.0),
withr,
knitr,
rmarkdown
LinkingTo:
Rcpp
Depends: R (>= 3.5.0)
NeedsCompilation: yes
Encoding: UTF-8
Language: en-GB
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Config/testthat/edition: 3
VignetteBuilder: knitr
Scope
-
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [x] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [x] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
- [x] rOpenSci tools
-
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
Data retrieval, because the package includes code to generate language model (LM) embeddings from all R packages retrieved from both CRAN and rOpenSci package repositories. Wrapper because LM embeddings are generated by wrapping interface to ollama software. Plus I've inserted a new, one-off category of "rOpenSci tools" for internal, staff-curated packages.
- Who is the target audience and what are scientific applications of this package?
Beyond internal rOpenSci use, target audiences are (1) entirely general audience of those interested in searching R packages using either text or code input, and (2) package developers, who can use this package to identify similar packages or functions to code they might be working on.
- Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
No, not at all. There are to my knowledge two other R packages for interfacing with LMs: tidyllm and elmer. Both of these are general interfaces to LM API endpoints, while this package specifically uses LM outputs to identify best-matching packages.
- (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
Not applicable.
-
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
-
Explain reasons for any
pkgcheckitems which your package is unable to pass.
Technical checks
Confirm each of the following by checking the box.
- [x] I have read the rOpenSci packaging guide.
- [x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.
This package:
- [x] does not violate the Terms of Service of any service it interacts with.
- [x] has a CRAN and OSI accepted license.
- [x] contains a README with instructions for installing the development version.
- [x] includes documentation with examples for all functions, created with roxygen2.
- [x] contains a vignette with examples of its essential functions and uses.
- [x] has a test suite.
- [x] has continuous integration, including reporting of test coverage.
Publication options
-
[x] Do you intend for this package to go on CRAN?
-
[ ] Do you intend for this package to go on Bioconductor?
-
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)
Code of conduct
- [x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.
Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.
:rocket:
Editor check started
:wave:
Checks for pkgmatch (v0.4.2)
git hash: f12ad732
- :heavy_check_mark: Package name is available
- :heavy_check_mark: has a 'codemeta.json' file.
- :heavy_check_mark: has a 'contributing' file.
- :heavy_check_mark: uses 'roxygen2'.
- :heavy_check_mark: 'DESCRIPTION' has a URL field.
- :heavy_check_mark: 'DESCRIPTION' has a BugReports field.
- :heavy_check_mark: Package has at least one HTML vignette
- :heavy_check_mark: All functions have examples.
- :heavy_check_mark: Package has continuous integration checks.
- :heavy_check_mark: Package coverage is 79.9%.
- :heavy_check_mark: R CMD check found no errors.
- :heavy_check_mark: R CMD check found no warnings.
Package License: MIT + file LICENSE
1. Package Dependencies
Details of Package Dependency Usage (click to open)
The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.
| type | package | ncalls |
|---|---|---|
| internal | base | 545 |
| internal | pkgmatch | 204 |
| internal | utils | 25 |
| internal | stats | 12 |
| internal | tools | 5 |
| imports | fs | 47 |
| imports | checkmate | 20 |
| imports | dplyr | 17 |
| imports | memoise | 13 |
| imports | treesitter | 8 |
| imports | httr2 | 7 |
| imports | pbapply | 5 |
| imports | brio | 2 |
| imports | rvest | 2 |
| imports | tibble | 2 |
| imports | tokenizers | 2 |
| imports | treesitter.r | 1 |
| imports | cli | NA |
| imports | curl | NA |
| imports | Rcpp | NA |
| imports | tidyr | NA |
| imports | vctrs | NA |
| suggests | gert | 2 |
| suggests | jsonlite | 2 |
| suggests | hms | 1 |
| suggests | piggyback | 1 |
| suggests | httptest2 | NA |
| suggests | pkgbuild | NA |
| suggests | rappdirs | NA |
| suggests | roxygen2 | NA |
| suggests | testthat | NA |
| suggests | withr | NA |
| suggests | knitr | NA |
| suggests | rmarkdown | NA |
| linking_to | Rcpp | NA |
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.
base
lapply (46), data.frame (31), which (27), names (24), vapply (23), length (19), nrow (17), grep (16), c (15), list (15), paste0 (14), seq_len (12), character (11), gsub (11), as.integer (10), by (10), ncol (10), tryCatch (10), unname (10), url (10), order (9), grepl (7), colnames (6), for (6), integer (6), readRDS (6), unlist (6), version (6), basename (5), colSums (5), format (5), ifelse (5), raw (5), seq (5), seq_along (5), tempdir (5), all (4), as.Date (4), asNamespace (4), do.call (4), is.null (4), strsplit (4), system (4), attr (3), difftime (3), getOption (3), log (3), matrix (3), proc.time (3), read.dcf (3), sqrt (3), table (3), unique (3), as.character (2), cbind (2), floor (2), is.na (2), ls (2), match (2), min (2), nzchar (2), options (2), regmatches (2), rowSums (2), sum (2), units (2), any (1), apply (1), as.matrix (1), class (1), cut (1), drop (1), file (1), gregexpr (1), list.files (1), logical (1), mean (1), new.env (1), parseNamespaceFile (1), paste (1), rank (1), rbind (1), readline (1), regexpr (1), rep (1), sort (1), switch (1), Sys.Date (1), Sys.getenv (1), Sys.time (1), system.file (1), tolower (1), unclass (1), vector (1)
pkgmatch
bm25_tokens_list (8), get_embeddings (7), not_null_index (7), bm25_idf (6), get_pkg_text (6), get_pkg_code (5), pkgmatch_bm25 (5), cosine_similarity (4), dl_prev_data (4), pkgmatch_embeddings_from_pkgs (4), rm_fns_from_pkg_txt (4), bm25_tokens (3), get_all_fn_descs (3), get_cache_file_name (3), get_embeddings_from_ollama (3), jina_model (3), pkgmatch_bm25_from_idf (3), pkgmatch_load_data (3), pkgmatch_treesitter_fn_tags (3), append_cols (2), attach_ns (2), bm25_idf_internal (2), bm25_tokens_internal (2), bm25_tokens_list_internal (2), days_in_this_month (2), dl_one_tarball (2), extract_tarball (2), get_calls (2), get_calls_in_functions (2), get_embeddings_intern (2), get_fn_defs_namespace (2), get_fn_descs_from_ns (2), get_local_pkg_dep_fns (2), get_local_pkg_deps (2), get_pkg_readme (2), get_pkg_text_internal (2), get_pkg_text_namespace (2), input_is_pkg (2), is_docker_sudo (2), is_windows (2), list_new_cran_updates (2), load_data_internal (2), m_list_remote_files (2), ollama_dl_jina_model (2), opt_is_quiet (2), pkg_fns_from_r_search (2), pkg_fns_from_r_search_internal (2), pkg_is_installed (2), pkg_name_from_path (2), pkgmatch_bm25_fn_calls (2), pkgmatch_bm25_fn_calls_internal (2), pkgmatch_bm25_from_idf_internal (2), pkgmatch_bm25_internal (2), pkgmatch_cache_path (2), pkgmatch_update_cran (2), append_data_to_bm25 (1), append_data_to_embeddings (1), append_data_to_fn_calls (1), apply_col_names (1), attach_base_rcmd_ns (1), attach_local_dep_namespaces (1), attach_this_pkg_namespace (1), convert_paths_to_pkgs (1), desc_template (1), extract_data_from_local_dir (1), fn_names_base (1), fn_names_rcmd (1), get_fn_defs_local (1), get_pkg_exported_fns (1), get_pkg_text_local (1), has_ollama (1), has_ollama_docker (1), has_ollama_local (1), head.pkgmatch (1), input_is_path (1), input_mentions_functions (1), make_cran_version_column (1), modify_by_lm_prop (1), ollama_check (1), ollama_has_jina_model (1), ollama_is_running (1), ollama_models (1), order_output (1), pkg_install_path (1), pkgmatch_browse (1), pkgmatch_cache_update_interval (1), pkgmatch_dl_data (1), pkgmatch_embeddings_from_text (1), pkgmatch_rerank (1), pkgmatch_similar_fns (1), pkgmatch_similar_pkgs (1), pkgmatch_update_data (1), pkgmatch_update_ropensci (1), rcmd_pkgs (1), rcpp_bm25 (1), registry_daily_chunk (1), rename_files_in_r (1), ros_registry (1), similar_pkgs_from_pkg (1), similar_pkgs_from_pkg_internal (1), similarity_embeddings (1), tok_lists_to_idfs (1), tressitter_calls_in_package (1)
fs
path (19), dir_ls (9), path_temp (7), dir_create (5), path_ext (3), file_exists (1), file_info (1), path_ext_set (1), path_real (1)
utils
installed.packages (4), lsf.str (4), data (3), packageDescription (3), prompt (3), tar (2), untar (2), browseURL (1), getFromNamespace (1), tail (1), timestamp (1)
checkmate
assert_character (7), assert_integerish (3), assert_matrix (2), assert_names (2), assert_numeric (2), check_file_exists (2), assert_list (1), assert_logical (1)
dplyr
left_join (8), rename (3), mutate (2), last_col (1), n (1), relocate (1), summarise (1)
memoise
memoise (13)
stats
dt (5), start (3), end (2), line (2)
treesitter
query_captures (3), node_text (2), parser (1), parser_parse (1), tree_root_node (1)
httr2
req_headers (2), request (2), resp_body_json (2), req_perform (1)
pbapply
pblapply (5)
tools
parse_Rd (2), Rd_db (2), CRAN_package_db (1)
brio
read_lines (2)
gert
git_clone (2)
jsonlite
read_json (2)
rvest
html_table (1), read_html (1)
tibble
new_tibble (2)
tokenizers
count_words (1), tokenize_words (1)
hms
hms (1)
piggyback
pb_download (1)
treesitter.r
language (1)
NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.
2. Statistical Properties
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
Details of statistical properties (click to open)
The package has:
- code in C++ (5% in 2 files) and R (95% in 20 files)
- 1 authors
- 6 vignettes
- no internal data file
- 17 imported packages
- 14 exported functions (median 14 lines of code)
- 218 non-exported functions in R (median 12 lines of code)
- 4 R functions (median 12 lines of code)
Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:
loc= "Lines of Code"fn= "function"exp/not_exp= exported / not exported
All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function
The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.
| measure | value | percentile | noteworthy |
|---|---|---|---|
| files_R | 20 | 79.8 | |
| files_src | 2 | 79.5 | |
| files_vignettes | 6 | 96.8 | |
| files_tests | 10 | 87.4 | |
| loc_R | 1905 | 81.1 | |
| loc_src | 91 | 13.6 | |
| loc_vignettes | 463 | 74.2 | |
| loc_tests | 694 | 77.2 | |
| num_vignettes | 6 | 97.6 | TRUE |
| n_fns_r | 232 | 90.5 | |
| n_fns_r_exported | 14 | 56.0 | |
| n_fns_r_not_exported | 218 | 93.0 | |
| n_fns_src | 4 | 21.1 | |
| n_fns_per_file_r | 7 | 79.5 | |
| n_fns_per_file_src | 2 | 27.8 | |
| num_params_per_fn | 2 | 8.2 | |
| loc_per_fn_r | 12 | 36.8 | |
| loc_per_fn_r_exp | 14 | 33.6 | |
| loc_per_fn_r_not_exp | 12 | 39.8 | |
| loc_per_fn_src | 12 | 38.9 | |
| rel_whitespace_R | 24 | 85.6 | |
| rel_whitespace_src | 26 | 21.8 | |
| rel_whitespace_vignettes | 20 | 57.6 | |
| rel_whitespace_tests | 21 | 77.1 | |
| doclines_per_fn_exp | 28 | 29.2 | |
| doclines_per_fn_not_exp | 0 | 0.0 | TRUE |
| fn_call_network_size | 187 | 87.0 |
2a. Network visualisation
Click to see the interactive network visualisation of calls between objects in package
3. goodpractice and other checks
Details of goodpractice checks (click to open)
3a. Continuous Integration Badges
GitHub Workflow Results
| id | name | conclusion | sha | run_number | date |
|---|---|---|---|---|---|
| 11727438106 | docker | skipped | f12ad7 | 23 | 2024-11-07 |
| 11727438100 | pkgcheck | NA | f12ad7 | 96 | 2024-11-07 |
| 11727438103 | R-CMD-check | success | f12ad7 | 292 | 2024-11-07 |
| 11727438110 | test-coverage | success | f12ad7 | 292 | 2024-11-07 |
| 11727438101 | Update pkgmatch data | NA | f12ad7 | 66 | 2024-11-07 |
3b. goodpractice results
R CMD check with rcmdcheck
rcmdcheck found no errors, warnings, or notes
Test coverage with covr
Package coverage: 79.93
Cyclocomplexity with cyclocomp
The following function have cyclocomplexity >= 15:
| function | cyclocomplexity |
|---|---|
| get_pkg_readme | 17 |
Static code analyses with lintr
lintr found no issues with this package!
Package Versions
| package | version |
|---|---|
| pkgstats | 0.2.0.47 |
| pkgcheck | 0.1.2.63 |
Editor-in-Chief Instructions:
This package is in top shape and may be passed on to a handling editor
@ropensci-review-bot assign @MargaretSiple-NOAA as editor
Assigned! @MargaretSiple-NOAA is now the editor
@ropensci-review-bot seeking reviewers
Please add this badge to the README of your package repository:
[](https://github.com/ropensci/software-review/issues/671)
Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news
(just a note: I put 'seeking reviewers' before putting in my editor checks but I have not forgotten them. I started them today and will continue tomorrow.)
Editor checks:
- [X] Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
- [X] Is the case for the package well made?
- [X] Is the reference index page clear (grouped by topic if necessary)?
- [X] Are vignettes readable, sufficiently detailed and not just perfunctory?
- [X] Fit: The package meets criteria for fit and overlap.
- [x] Installation instructions: Are installation instructions clear enough for human users? - Yes but see notes below.
- [x] Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling? Yes but see notes below. I think I just can't fully evaluate this until I can install docker and ollama on my personal computer and test the functions
- [X] Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
- [X] License: The package has a CRAN or OSI accepted license.
- [X] Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?
Editor comments
Nice package, @mpadge ! This will be useful for anyone working on package development or review. I wrote a few notes on what I saw during the editor check process-- they're longer than I usually write, but I figure these will probably make the reviewers' lives easier so it doesn't hurt to mention them early.
A few notes of mine from the editor check process:
-
The pkgdown page seems to indicate that the user can set the corpus to rOpenSci or to CRAN, but some functions only provide search results for rOpenSci, e.g.,
pkgmatch_bm25()requires the corpus to be rOpenSci-related. Some clarity about which functions apply to which package corpi (?) might be helpful. -
I got 1 note from running
devtools::check(): "Package suggested but not available for checking: 'piggyback'" -- I don't worry about these too much just wanted to flag it just in case. -
The ollama connection gave me some trouble at the start, so I imagine future users might struggle on this. I recommend revising the documentation a little bit to make it more obvious what needs to happen with ollama before users can get started: a) Consider adding a sentence to the top of the "Get started" article that indicates that docker and ollama installs need to happen first and foremost. Here, I would link to the "ollama" article. b) Change the title of the "ollama" article to say something like "Before you begin: ollama" or "Before you begin: ollama installation", so that when people are browsing, it's very clear to them that they'll need to read that article first.
-
The beginning of the Docker section in the ollama article should indicate that users need to have Docker installed as well. I learned from this process that my institution actually doesn't have a Docker license! This may impede editors/reviewers who work at NOAA. It sounds like they're working on getting a license but for now I don't have access to it.
-
I had some trouble getting test coverage my usual way (
covr::package_coverage()fails with an error about a temp file)... butdevtools::test()works great). Some tests are failing but I'm pretty sure that's because I don't have ollama properly installed on my machine. It might be nice to have a one-line check in one of the early vignettes that can show people whether they are missing any of the components need to run the fns.
Thanks @MargaretSiple-NOAA for really useful feedback! The issue linked above has details of changes made in response. I'd say the most important of those for reviewers of this package is that I've added tests/README explaining that tests can be run without having ollama installed or running anywhere.
@ropensci-review-bot Add @agricolamz as reviewer
@agricolamz added to the reviewers list. Review due date is 2025-01-02. Thanks @agricolamz for accepting to review! Please refer to our reviewer guide.
rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.
@agricolamz: If you haven't done so, please fill this form for us to update our reviewers records.
@ropensci-review-bot set due date for @agricolamz to 2025-01-31
Review due date for @agricolamz is now 31-January-2025
Hi @mpadge -- I was just revising my editor checks above after your revisions, and everything is looking good except some issues I had with devtools::test() - these are documented in the issue I logged on the pkgmatch github page (above). I don't know what they mean but I think we should have a resolution before reviewers are running similar checks.
Thanks @MargaretSiple-NOAA, I've fixed the issue linked above, so devtools::test() should now work for everybody.
Thank for for addressing that , @mpadge ! The tests now run without issues.
Two more quick things:
pkgmatch_similar_pkgs(..., corpus = "cran")always gives a warningError in matrix(emb[[what]], nrow = nrow, ncol = npkgs) : non-numeric matrix extent. This renders fine on the pkgdown page here so I suspect it is a computer-specific issue. Everything works as expected whencorpus = "ropensci".- Several of the tests fail and trace back to
parseNamespaceFile(pkg_name, package.lib = lp). I think it's fine to fix this when reviews are in but wanted to give you a heads-up while I'm still searching for a Reviewer 2.
@ropensci-review-bot add @Selbosh as reviewer
@Selbosh added to the reviewers list. Review due date is 2025-02-17. Thanks @Selbosh for accepting to review! Please refer to our reviewer guide.
rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.
@Selbosh: If you haven't done so, please fill this form for us to update our reviewers records.
@ropensci-review-bot Set due date for @Selbosh to 2025-02-21
Review due date for @Selbosh is now 21-February-2025
Dear @mpadge this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see.
I see one review is due about now and the second one on Feb 21st. It all looks good to me. I'll step back. Thanks!
@ropensci-review-bot Set due date for @agricolamz to 2025-02-21
Review due date for @agricolamz is now 21-February-2025
@mpadge Having some issues getting ollama to cooperate on Ubuntu; unfortunately this also causes the package to give a cryptic error message.
pkgmatch_similar_pkgs(input)
Error in names(out) <- nms :
'names' attribute [1] must be the same length as the vector [0]
Found some documentation here so will try to figure it out and get the review done soon.
I think I found one half of the the problem: set_ollama_url does not modify the system environment variables, so the system call to ollama list within ollama_check() will fail depending on how you have set up your shell environment variables (e.g. if you export OLLAMA_HOST=localhost:8888 but then open R in another session, then the exported variable is lost).
In addition, ollama_check() fails and gives the error above if ollama is running, but without any models downloaded (it's trying to iterate over a list of length zero, I guess?)
So the current sequence, after setting ollama serve running on localhost:8888 in another shell and manually pulling the two embeddings in the vignette, is:
Sys.getenv('OLLAMA_URL')
## [1] ""
library(pkgmatch)
ollama_check()
## Error: could not connect to ollama app, is it running?
## Error in out[[1]] : subscript out of bounds
## In addition: Warning message:
## In system("ollama list", intern = TRUE) :
## running command 'ollama list' had status 1
set_ollama_url('127.0.0.1:8888')
ollama_check()
## Error: could not connect to ollama app, is it running?
## Error in out[[1]] : subscript out of bounds
## In addition: Warning message:
## In system("ollama list", intern = TRUE) :
## running command 'ollama list' had status 1
Sys.setenv(OLLAMA_HOST = '127.0.0.1:8888')
ollama_check()
## [1] TRUE
But then I still do not get much farther, because
pkgmatch_similar_pkgs(input)
## Error in `httr2::req_perform()`:
## ! Failed to perform HTTP request.
## Caused by error in `curl::curl_fetch_memory()`:
## ! Couldn't connect to server [localhost]: Failed to connect to localhost port 11434 after 1 ms: Connection refused
## Run `rlang::last_trace()` to see where the error occurred.
which implies that there is least one other place that 11434 is hard-coded. A quick search found
https://github.com/ropensci-review-tools/pkgmatch/blob/21c1e138535f86de6501ec8549f7c8a6f6ac4bf9/R/embeddings.R#L240
which unfortunately means I can't proceed on Ubuntu without fixing this bug first.
Package Review
Hello @mpadge and @MargaretSiple-NOAA! Thank you giving me the opportunity to review the package. It is my first package review, and I am still learning the process, so I hope that my feedback is useful. As mentioned above, I had a couple of issues getting started, but have found some fixes that should with any luck make it easier for others who encounter similar problems.
Please let me know if I can clarify anything. Here are my comments:
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- Briefly describe any working relationship you have (had) with the package authors.
- [x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- [ ] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
- [x] Installation instructions: for the development version of package and any non-standard dependencies in README
- [x] Vignette(s): demonstrating major functionality that runs successfully locally
- [x] Function Documentation: for all exported functions
- [x] Examples: (that run successfully locally) for all exported functions
- [x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with
URL,BugReportsandMaintainer(which may be autogenerated viaAuthors@R).
Functionality
- [x] Installation: Installation succeeds as documented.
- [x] Functionality: Any functional claims of the software have been confirmed.
- [x] Performance: Any performance claims of the software have been confirmed.
- [x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- [x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.
Estimated hours spent reviewing: 5
- [x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
Review Comments
Documentation
In the README, the statement of need should be more prominent; it's not obvious who the target audience of the package is or why the package is needed. It would be great if you could explain early on, clearly, what problem pkgmatch solves, maybe giving examples of use cases from the start.
I think that you could improve the documentation by making it more self-contained and making it easier for the user to find information quickly. Some information that should be in the README is only in a vignette, and some information that should be at the start of one vignette is instead left as a link to another vignette. The dependence on LLMs is also somewhat buried: this could be a make-or-break feature for many users, but it's only mentioned at the bottom of the Installation section. I can imagine it could be a selling point that pkgmatch exploits LLMs, but it could also be a deal-breaker for some users either due to technical limitations or ethical concerns.
The package relies on ollama but it's not explained in the README what this is or why the package should depend on it, or whether you can use some functions without it. There is a function called ollama_check() that ostensibly checks for and installs ollama: add this to the README, as it seems like the first thing you would want to run after loading the package! "Procedures for setting that up are described in a separate vignette" is not super helpful without a direct link to or giving the name of said vignette! Ideally I'd put much of this information in the README, because it could be a decider on whether somebody can/will install the package in the first place. Given we are talking about local LLMs here, not everybody will know what their computational demands are, so it would be helpful to give some general indication of system requirements in the 'Before you begin' vignette and maybe in the README.
Generally speaking, I think the documentation assumes a bit too much knowledge on the part of the user, by not motivating the task clearly enough, not explaining what some functions do or what some terms mean. There are one or two functions documented in terms of what they are not rather than what they are, which can be a bit confusing (e.g. pkgmatch_bm25_fn_calls()). Some function descriptions (e.g. pkgmatch_bm25()) simply comprise an external Web link, which is not ideal for offline use, nor for users who want to know what a function does without having to click around. For the same functions, the current Titles in the documentation are too long or specific and would be better off as Descriptions, and the current Descriptions as Details or Notes.
The purpose of some of the vignettes is unclear. For example, "Good-enough practices for language model packages": who is this for, exactly? Why does it need to be a vignette? Why do "Data caching and updating" or "Why local LMs?" need their own vignettes, instead of being included as sections in a larger general guide? This information is surely useful, but it could be merged into a single vignette and/or the README, maybe with a dedicated vignette for developers, to better signpost the user to the information they need (or don't need).
Functionality
I was able to install the package (v0.4.3.018) first try on WSL2 (Ubuntu on Windows 11) and didn't face any issues with building vignettes or R CMD CHECK. All tests passed successfully.
Not really sure where to start, I decided to run the example in the README:
pkgmatch_similar_pkgs(input)
This threw an error:
Error in names(out) <- nms :
'names' attribute [1] must be the same length as the vector [0]
I then remembered that I hadn't started the ollama server, so I go to the command line and run ollama serve, to get
Error: listen tcp 127.0.0.1:11434: bind: address already in use
This might be quite esoteric but could be worth mentioning in the vignette. A bit of detective work suggests that the error message above is in fact coming from ollama_check(), which when run throws exactly the same error. This left me a bit stuck, due to a couple of issues in the error handling, and due to a bug in the embeddings.R file. I made a local patch to fix it on my end and was able to proceed from here.
Once I got the package working, the benefits of it seem clear! I can see this being immensely useful to package developers and advanced users who are looking for existing functionality, either for specialist tasks or trying not to reinvent the wheel while developing their own packages. However, as described above, the documentation comes across as being aimed at people who are already familiar with the package, rather than to new users who are trying to get it to work for the first time.
Another interface issue I note is that with most function calls, there is a slight delay and we see that data is being downloaded in the background:
pkgmatch_similar_pkgs (".", corpus = "cran")
## [100%] Downloaded 231213455 bytes...
## [100%] Downloaded 36834164 bytes...
## [100%] Downloaded 3948804 bytes...
## [100%] Downloaded 13016426 bytes...
but what data is being downloaded, why and where to? Is it the corpus? Is it the embeddings? Maybe we should ask the user if they'd like to use up their bandwidth and disk space on this, or reassure them that it's going to a temporary directory and/or can easily be cleaned out at the end of the session? Which functionality will work offline, if any? This could be made a bit clearer on the first use.
The browse = TRUE function threw a further error for me, possibly related to using VS Code.
pkgmatch_similar_fns (input, browse = TRUE)
# VSCode WebView only supports showing local http content.
# Opening in external browser...
# Browsing https://docs.ropensci.org/charlatan/reference/SequenceProvider
# Error in vapply(urls, utils::browseURL, integer(1L)) :
# values must be length 1,
# but FUN(X[[1]]) result is length 0
However, the help page successfully opened in the browser, so I'm not sure what the issue is here. I was able to run other examples successfully locally, though nearly all of them are 'Don't run' because of the external dependencies; I wonder if there is the possibility to embed some toy embeddings and system.files in the package?
Other comments
In the examples and README, you have spaces between function names and the brackets, e.g. fun () instead of fun(), which is not recommended in most style guides.
Overall I think this is a good idea for a package with real potential, but the main thing holding it back is selling to the user what it's for and giving an easy way to get started. To do this, I'd suggest filling in the gaps and maybe focus on having a vignette with an extended use case that shows off the package's capabilities. There are also one or two bugs or edge cases that thwarted me personally, but might not come up for other users, and so this might have coloured my impressions of the package.
In the long run, I'd suggest clearly marking out which functionality is dependent on which external tools, as then a new user might be able to explore some features even if they have issues with others or do not want to set them up yet.
Thank you very much @Selbosh for your review. You've really immediately picked up the big weakness of package design thus far: It was pulled together for a very specific use case, with documentation mostly hobbled together around that narrow vision, rather than me having thought nearly as much as I should have about general usage. In short: I agree with all of your concerns, and will address them hopefully within the coming fortnight. You should see several new issues in the repo linked immediately below here.