software-review submission: pangoling: Access to word predictability using large language (transformer) models

Submitting Author Name: Bruno Nicenboim Submitting Author Github Handle: @bnicenboim Repository: https://github.com/bnicenboim/pangoling Version submitted: 0.0.0.9005 Submission type: Standard Editor: @karthik Reviewers: @lisalevinson, @utkuturk

Due date for @lisalevinson: 2023-05-24

Due date for @utkuturk: 2023-05-29 Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: pangoling
Type: Package
Title: Access to Large Language Model Predictions
Version: 0.0.0.9005
Authors@R: c(
    person("Bruno", "Nicenboim",
    email = "[email protected]",
    role = c( "aut","cre"),
    comment = c(ORCID = "0000-0002-5176-3943")),
    person("Chris", "Emmerly", role = "ctb"),
    person("Giovanni", "Cassani", role = "ctb"))
Description: Access to word predictability using large language (transformer) models.
URL: https://bruno.nicenboim.me/pangoling, https://github.com/bnicenboim/pangoling
BugReports: https://github.com/bnicenboim/pangoling/issues
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: false
Config/reticulate:
  list(
    packages = list(
      list(package = "torch"),
      list(package = "transformers")
    )
  )
Imports: 
    data.table,
    memoise,
    reticulate,
    tidyselect,
    tidytable (>= 0.7.2),
    utils,
    cachem
Suggests: 
    rmarkdown,
    knitr,
    testthat (>= 3.0.0),
    tictoc,
    covr,
    spelling
Config/testthat/edition: 3
RoxygenNote: 7.2.3
Roxygen: list(markdown = TRUE)
Depends: 
    R (>= 4.1.0)
VignetteBuilder: knitr
StagedInstall: yes
Language: en-US

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] data validation and testing
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [x] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [x] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package is built on top of the python package transformers, and it offers some basic functionality for text analysis, including tokenization and perplexity calculation. Crucially pangoling also offers word predictability, which is widely used as a predictor in psycho and neurolinguistics, and it's not trivial to obtain. Also transformers works with "tokens" rather than "words", and then pangoling takes cares of the mapping between the tokens to the target words (or even phrases).

Who is the target audience and what are scientific applications of this package?

This is mostly for psycho/neuro/- linguists that use word predictability as a predictor in their research, such as in ERP/EEG and reading studies.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Another R package that acts as a wrapper for transformers is text However, text is more general, and its focus is on Natural Language Processing and Machine Learning. pangoling is much more specific and the focus is on measures used as predictors in analyses of data from experiments, rather than NLP. text doesn't allow for generating pangoling output in a straightforward way and in fact, I'm not sure if it's even possible to get token probabilities from text since it seems more limited than the python package transformers.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

NA

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

#573

Explain reasons for any pkgcheck items which your package is unable to pass.

pkgcheck fails only because of the use of <<-. But this is done in .OnLoad as recommended by reticulate. Also see this issue .

Technical checks

Confirm each of the following by checking the box.

[x] I have read the rOpenSci packaging guide.
[x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license. -> I think so, it has an MIT license
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

[ ] The package is novel and will be of interest to the broad readership of the journal.
[ ] The manuscript describing the package is no longer than 3000 words.
[ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
(Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
(Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
(Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Feb 21 '23 13:02 bnicenboim

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

Feb 21 '23 13:02 ropensci-review-bot

:rocket:

Editor check started

:wave:

Feb 21 '23 13:02 ropensci-review-bot

Checks for pangoling (v0.0.0.9005)

git hash: 543c11bd

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_multiplication_x: Package coverage is 0.9% (should be at least 75%).
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	97
internal	pangoling	41
internal	graphics	2
internal	stats	1
imports	tidytable	20
imports	reticulate	5
imports	memoise	3
imports	cachem	2
imports	data.table	1
imports	tidyselect	1
imports	utils	NA
suggests	rmarkdown	NA
suggests	knitr	NA
suggests	testthat	NA
suggests	tictoc	NA
suggests	covr	NA
suggests	spelling	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

lapply (19), length (7), c (6), dim (6), paste0 (6), t (6), unlist (4), by (3), list (3), names (3), seq_len (3), which (3), do.call (2), getOption (2), matrix (2), ncol (2), rep (2), seq_along (2), sum (2), unname (2), as.list (1), floor (1), for (1), grepl (1), lengths (1), mode (1), new.env (1), options (1), rownames (1), split (1), switch (1), vector (1)

pangoling

create_tensor_lst (5), lst_to_kwargs (5), char_to_token (4), encode (4), get_id (4), get_vocab (4), get_word_by_word_texts (2), masked_lp_mat (2), causal_config (1), causal_lp (1), causal_lp_mats (1), causal_mat (1), causal_next_tokens_tbl (1), causal_preload (1), causal_tokens_lp_tbl (1), chr_detect (1), masked_config (1), num_to_token (1), word_lp (1)

tidytable

map_chr. (4), map2 (3), map. (2), pmap. (2), arrange. (1), map (1), map_dbl. (1), map_dfr (1), map_dfr. (1), map2_dbl. (1), pmap_chr (1), relocate (1), tidytable (1)

reticulate

py_to_r (5)

memoise

memoise (3)

cachem

cache_mem (2)

graphics

text (2)

data.table

chmatch (1)

stats

lm (1)

tidyselect

everything (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 7 files) and
1 authors
2 vignettes
no internal data file
7 imported packages
14 exported functions (median 9 lines of code)
57 non-exported functions in R (median 9 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	7	45.7
files_vignettes	2	85.7
files_tests	7	86.4
loc_R	733	59.3
loc_vignettes	154	39.9
loc_tests	293	63.6
num_vignettes	2	89.2
n_fns_r	71	67.2
n_fns_r_exported	14	56.3
n_fns_r_not_exported	57	71.1
n_fns_per_file_r	6	75.6
num_params_per_fn	4	54.6
loc_per_fn_r	9	24.3
loc_per_fn_r_exp	10	21.6
loc_per_fn_r_not_exp	9	27.1
rel_whitespace_R	10	45.0
rel_whitespace_vignettes	42	47.5
rel_whitespace_tests	12	48.3
doclines_per_fn_exp	44	55.2
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	72	73.5

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

GitHub Workflow Results

id	name	conclusion	sha	run_number	date
4232217801	pages build and deployment	success	c6ce46	16	2023-02-21
4232181659	pkgdown	success	543c11	44	2023-02-21
4232181660	R-CMD-check	success	543c11	37	2023-02-21
4232181654	test-coverage	success	543c11	39	2023-02-21

3b. `goodpractice` results

`R CMD check` with rcmdcheck

rcmdcheck found no errors, warnings, or notes

Test coverage with covr

Package coverage: 0.89

The following files are not completely covered by tests:

file	coverage
R/tr_causal.R	0%
R/tr_masked.R	0%
R/tr_utils.R	0%
R/utils.R	0%
R/zzz.R	0%

Cyclocomplexity with cyclocomp

The following function have cyclocomplexity >= 15:

function	cyclocomplexity
word_lp	16

Static code analyses with lintr

lintr found the following 39 potential issues:

message	number of times
Avoid library() and require() calls in packages	5
Lines should not be more than 80 characters.	34

Package Versions

package	version
pkgstats	0.1.3
pkgcheck	0.1.1.11

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with :heavy_multiplication_x: have been resolved.

Feb 21 '23 13:02 ropensci-review-bot

Hi, " Package coverage is 0.9% (should be at least 75%)." -> this seems to be an error. See that the coverage is 94%. Maybe it is because the tests skip almost everything, if there is no python installed? (As recommended by reticulate).

Also when I run pkgcheck::pkgcheck() I get the following:

✔ Package coverage is 94.8%.

Feb 21 '23 13:02 bnicenboim

Thanks @bnicenboim for your full submission and for explaining the issue with test coverage. Your explanation makes sense, so we can move forward. I'll start searching for a handling editor.

In the meantime you may want to start thinking of potential reviewers to suggest to the handling editor.

Feb 22 '23 10:02 maurolepore

Is there a pool of potential reviewers that I can have access to?

Feb 22 '23 11:02 bnicenboim

I guess authors are mostly guided by their knowledge of their intended audience. But for inspiration see how editors look for reviewers. Editors have access to a private airtable database, but often we look elsewhere.

Feb 26 '23 14:02 maurolepore

Dear @bnicenboim I'm sorry for the extraordinary delay in finding a handling editor. Most editors are busy and some handling more than one package. And the very few available are not yet due to handle another submission. Please hold a bit longer.

Mar 07 '23 20:03 maurolepore

ok, thanks for letting me know, no problem.

Mar 08 '23 08:03 bnicenboim

@ropensci-review-bot assign @karthik as editor

Mar 10 '23 16:03 karthik

Assigned! @karthik is now the editor

Mar 10 '23 16:03 ropensci-review-bot

:wave: @bnicenboim I'll follow up with next steps in a few days. In the meantime I'll start pinging a few reviewers.

Mar 10 '23 16:03 karthik

Hi, any news about the next steps?

Mar 23 '23 14:03 bnicenboim

Hi @bnicenboim Apologies for the delay, I've been out sick this past week. I'll update this thread in the next few days.

Mar 30 '23 15:03 karthik

Editor checks:

[x] Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
- [x] Is the case for the package well made?
- [x] Is the reference index page clear (grouped by topic if necessary)?
- [x] Are vignettes readable, sufficiently detailed and not just perfunctory?
[x] Fit: The package meets criteria for fit and overlap.
[x] Installation instructions: Are installation instructions clear enough for human users?
[x] Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
[x] Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
[x] License: The package has a CRAN or OSI accepted license.
[x] Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments

No additional comments at this time. I'm looking for reviewers at the moment, but if you've got any suggestions for people with expertise but no conflict, please suggest names.

May 02 '23 17:05 karthik

@ropensci-review-bot seeking reviewers

May 02 '23 17:05 karthik

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/575_status.svg)](https://github.com/ropensci/software-review/issues/575)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

May 02 '23 17:05 ropensci-review-bot

I really don't know about reviewers, I guess someone involved in the packages named here: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Or maybe someone based on the reverse imports of reticulate: https://cloud.r-project.org/web/packages/reticulate/index.html

May 03 '23 11:05 bnicenboim

@ropensci-review-bot assign @lisalevinson as reviewer

May 03 '23 15:05 karthik

@lisalevinson added to the reviewers list. Review due date is 2023-05-24. Thanks @lisalevinson for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

May 03 '23 15:05 ropensci-review-bot

@lisalevinson: If you haven't done so, please fill this form for us to update our reviewers records.

May 03 '23 15:05 ropensci-review-bot

@ropensci-review-bot assign @utkuturk as reviewer

May 08 '23 18:05 karthik

@utkuturk added to the reviewers list. Review due date is 2023-05-29. Thanks @utkuturk for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

May 08 '23 18:05 ropensci-review-bot

@utkuturk: If you haven't done so, please fill this form for us to update our reviewers records.

May 08 '23 18:05 ropensci-review-bot

:calendar: @lisalevinson you have 2 days left before the due date for your review (2023-05-24).

May 22 '23 15:05 ropensci-review-bot

:calendar: @utkuturk you have 2 days left before the due date for your review (2023-05-29).

May 27 '23 18:05 ropensci-review-bot

Hi @bnicenboim, @karthik, @lisalevinson,

Sorry, my review took more time than necessary. Tremendous thanks to @bnicenboim for his efforts writing this package.

My review is just some notes from a regular user. I already used this package before this review, but used another laptop to tests things out. I plan on using this package for the foreseeable future as well.

I am happy to continue tests things as the package develops. :) I am also happy to help with the writing a community vignette on a minimal use of reticulate to get everything started with miniconda and conda environments.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README

Even though the installation of the package is straightforward, I do not think installing the package itself makes it immediately usable. I have some notes on this. See below.

[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software have been confirmed.
[x] Performance: Any performance claims of the software have been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

I am not sure about what to look for here. I did not chose this as complete mainly because "tests" are skipped for transformers, and the author did not include additional unit tests in his functions (at least I was not able to find them). I am happy to edit this part if I got corrected on the issue. This is simply my ignorance.

[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

The sole reason that it does not conforms to packaging guidelines is because the package does not include CONTRIBUTING or does not have contribution guidelines in the README. It includes necessary descriptions for URL, BugReports and Maintainer in the DESCRIPTION.

Estimated hours spent reviewing: 10 hours

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

My main comments will be around two topics: (1) installation of the dependencies, (2) use of other arguments for the exported functions.

Installation:

Clarity Problem wrt Python

The process of setting up Github and installing it locally went smoothly, without encountering any problems.

However, there's a little hiccup that occurs during both installation methods. It doesn't automatically install the necessary modules for python packages that are used. It only does so when you try to run an R code that requires those packages. The good news is that it takes care of this installation automatically once you have successfully defined your python in R.

This feature could be highly valuable for users as it grants them control over their Python modules, who are already advanced users and will have no problem using python in R. But, I do not think this is true of many psycholinguists or linguists who is the target audience for the package. I think there seems to be a lack of clarity regarding the overall process with respect to python. It's not evident whether the package utilizes its own Python environment, relies on an API, or requires any adjustments to the existing Python installation.

Suggestions

I know both of these suggestions are somewhat against ropensci guidelines since it advises against unnecessary start-up messages. However, I think these are not as unnecessary as their examples.

A warning message

To enhance transparency, it would be helpful to have a concise warning message displayed when the pangoling library is initially attached, similar to the cmdstanr package. This warning message could include the following components:

Verification and declaration of the python/conda/miniconda installation status. (using reticulate::py_discover_config()$python_versions or maybe a simpler code using the base system2() function)
Confirmation and specification of the conda/miniconda/python environment being utilized. (reticulate::use_python(PATH))
A recommendation to create a new python/conda environment using the reticulate package.(conda_create("maze_item_check"); use_condaenv("maze_item_check"))
Verification of the availability of the torch and transformers modules within the environment, prompting users to install them if they are not found.

It is reasonable to assume that users of this package are already familiar with using python via R. However, it would indeed be beneficial to explicitly state this assumption.

A vignette

Given that the package aims to serve as a wrapper for these python tools, targeting individuals like myself who may not possess advanced technical knowledge, it would be advantageous to provide a vignette that offers guidance on getting started with pangoling. This vignette could include step-by-step instructions on the following fundamental aspects:

Basic installation of the miniconda.
Creation of a new environment within miniconda.
Module installation within the environment using the R code.

By including such a vignette, users with limited technical background would be able to grasp the essential procedures involved in setting up and utilizing pangoling effectively.

Another clarity issue: Downloading packages

It is essential to ensure clarity regarding the download of the models to the local system and emphasize that the package does not operate through an API or an online instantiation. While this may be a common knowledge for regular users of transformers or python, it is important to remember that not all psycholinguists possess the same level of familiarity with these tools.

To address this, it would be beneficial to explicitly state some of the following points in either README:

The models included in the package will be downloaded and stored locally on the user's system.
The package does not rely on an API for model usage.

Advanced Topics in Vignettes

One aspect that I noticed was the absence of information regarding the usage of NULL arguments in the exported functions. It would be greatly appreciated if the authors could provide examples illustrating the use of NULL arguments. Although it is understood that each pretrained model may have unique configurations, having basic examples directly from the authors demonstrating how these arguments are implemented in the code would be highly beneficial.

Currently, the function description includes a link for more details directing us to "from_pretained" configs. However, supplementing it with concrete examples would assist users in understanding the practical application of non-NULL arguments for those arguments in the context of this package. If the transformers or text package have this, the link to those vignettes might be useful. This additional guidance from the authors would be a valuable addition to the documentation.

May 29 '23 08:05 utkuturk

Thanks so much for your work on this package @bnicenboim! Sorry this review is a bit later than expected. I have made a lot of comments in my review on my personal user interface/workflow preferences, but the package is already very useful “as is”. Let me know if you have any questions, or would like any help with some of the vignette/documentation suggestions I have made - I’m happy to help, though I can’t promise a speedy turnaround.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

I have no prior personal or working relationship with the package creator.

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s): demonstrating major functionality that runs successfully locally

The Rmd vignettes exist (and are viewable as rendered on the package website), but they did not seem to install properly, either from the local package folder or from GitHub, even when building vignettes is set to true. I believe this may be related to an issue that occurs due to the inst/docs folder being deleted on build (there isn’t such a folder in the repo). Running vignette(package = "pangoling") yields “no vignettes found” for me.

[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions

Examples run, but one has a typo, described below.

[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

I’m not experienced with testing, but there were some failures and warnings noted below. There did not seem to be many tests.

[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 11

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer (“rev” role) in the package DESCRIPTION file.

Review Comments

Thanks so much for creating this package - I demo’d it to my students when it first came out to show them ways to do “Python” things within the R environment, and it was a great example for that. I also am very grateful to have a tool that would allow me to more easily carry out these kinds of analyses from R, rather than passing data over from Python projects (as I had previously been doing). It seems similar in spirit to the Python package minicons.

Further Comments on Documentation

I did not have a “fresh” system without Python and reticulate already setup to test on. I’m not sure how automated it is in current workflows to set that up, and there isn’t any specific guidance within the package. If the target population is (psycho)linguists who aren’t familiar with Python, and might not even have Python installed, then they would probably need more guidance through this process.

Given the variety of inputs and outputs, it would be very helpful in the documentation/vignettes to have a table summarizing these for each function so it’s easier to tell which one you need for which purpose.

Workflow

One general comment is that the functions could be a little easier to use for the output that I am usually needing to generate for analyses. In my work, I’m typically gathering probabilities/surprisals for all of the words in a sentence or passage, not just one target word. Thus the Python scripts I have been using are based on specifying a complete sentence (“The apple fell far from the tree.”) and either specifying a target word position or getting probabilities across all of the word positions. This is awkward to do with the pangoling functions. It is a bit more straightforward for the causal model functions, but since the masked model functions require arguments for “pre”, “target”, and “post”, it is more cumbersome. I would prefer an argument structure more like the following (starting counts at 1 as “native” in R):

causal_lp(s = "The apple doesn't fall far from the tree.", target = 8)
masked_lp(s = "The apple doesn't fall far from the tree.", mask = 2, fill = c("apple", "pear"))

The default “fill” for masked_lp could be the word present in the original sentence, so that the following would use the mask “banana”, and would be very similar to running the causal_lp function:

masked_lp(s = "The banana doesn't fall far from the tree.", mask = 2)

Perhaps functions like these could be separate ones from those already provided.

Reporting and Documentation of Probabilities

Based on some independent testing, it seems that the log probabilities are natural log, but I couldn’t find that documented anywhere. It is fairly common to use base 2 log for surprisals (in bits), so which log is used should at least be indicated clearly. It would be great to make this an argument which can be specified, as base = for the various functions.

On this note, it seems that most folks use surprisals in the psycholinguistic literature, and they are a bit easier to interpret than negative probabilities. It would be nice to offer a “surprisal” option, which would usually be negative log2 probability.

Output Formats

The outputs of the functions are in a variety of structures, which is a bit confusing. It would be nice to have more consistency, though I can see to some extent why in some cases there is a named vector and in others a tidytable. I think it would be helpful for users to have a vignette with some more examples of working with the output to get it into more common shapes for next steps in an analysis, such as using enframe to turn a named vector into a tibble/tidytable if desired.

Non-English Usage and Examples

As a linguistics-oriented package, it would be helpful to include some vignette examples that include languages other than English, and to mention any potential limitations for languages with different orthographic/tokenization issues. For example, Mandarin Chinese does not use spaces, but it might be possible to use the functions if spaces (or another delimiter) are indicated where word boundaries are desired. There are other issues that arise for the tokenization of these languages, however, as most (all?) of the models will be based on single character tokenization, which may not provide the best probabilities for words which are predominantly multi-character. These things are a little more clear if one is coding the Python for transformers, but are obscured by the wrapper functions, so seem relevant to at least point towards other resources on.

library(pangoling)
library(tidyverse)

Specific Function Comments

`causal_lp`

The causal_lp() function extracts log probabilities from a language model, by default from GPT-2.

I find it confusing that the x argument can be used it 2 different ways, either as a full sentence where each word is a target, or as targets with a separate l_contexts prefix. The examples don’t make it clear how a non-atomic x vector would work with l_contexts:

causal_lp(
  x = "tree.",
  l_contexts = "The apple doesn't fall far from the",
  .by = NULL, # it's ignored anyways
  model = "gpt2"
)

## Processing using causal model ''...

## Processing a batch of size 1 with 10 tokens.

## Text id: 1
## `The apple doesn't fall far from the tree.`

##     tree. 
## -1.581758

causal_lp(
  x = c("branch.", "tree."),
  l_contexts = "The apple doesn't fall far from the",
  .by = NULL, # it's ignored anyways
  model = "gpt2"
)

## Processing using causal model ''...

## Processing a batch of size 1 with 10 tokens.
## Processing a batch of size 1 with 10 tokens.

## Text id: 1
## `The apple doesn't fall far from the branch.`

## Text id: 2
## `The apple doesn't fall far from the tree.`

##    branch.      tree. 
## -11.221713  -1.581758

causal_lp(
  x = c("branch.", "tree."),
  model = "gpt2"
)

## Processing using causal model ''...

## Processing a batch of size 1 with 5 tokens.

## Text id: 1
## `branch. tree.`

##   branch.     tree. 
##        NA -13.32788

The last one doesn’t make much sense, but it seems odd to me that it looks very similar to the one above it, yet what it’s actually checking is the probability of “tree.” following “branch.”.

I don’t really understand the .by documentation. The documentation seems to suggest it indicates the text separator, but I think it is only used as a grouping variable if one has multiple sentences in a dataframe.

The documentation is also not clear about what “batch size” is and how it should/could be used.

There is a typo in the ‘run examples’ where ‘tree.’ is given twice - l-contexts should just be “The apple doesn’t fall far from the” as follows:

output_causal_lp <- causal_lp(
  x = "tree.",
  l_contexts = "The apple doesn't fall far from the",
  .by = NULL, # it's ignored anyways
  model = "gpt2"
)

## Processing using causal model ''...

## Processing a batch of size 1 with 10 tokens.

## Text id: 1
## `The apple doesn't fall far from the tree.`

Some ways to get long dataframe output from named vectors (recommended for documentation):

output_causal_lp <- causal_lp(
  x = c("The", "apple", "doesn't", "fall", "far", "from", "the", "tree."),
  model = "gpt2"
)

## Processing using causal model ''...

## Processing a batch of size 1 with 10 tokens.

## Text id: 1
## `The apple doesn't fall far from the tree.`

tibble::enframe(output_causal_lp)

## # A tibble: 8 × 2
##   name      value
##   <chr>     <dbl>
## 1 The      NA    
## 2 apple   -10.9  
## 3 doesn't  -5.50 
## 4 fall     -3.60 
## 5 far      -2.91 
## 6 from     -0.745
## 7 the      -0.207
## 8 tree.    -1.58

data.frame(token = names(output_causal_lp), lp = output_causal_lp, row.names = NULL)

##     token          lp
## 1     The          NA
## 2   apple -10.9004831
## 3 doesn't  -5.4999362
## 4    fall  -3.5977294
## 5     far  -2.9119723
## 6    from  -0.7454861
## 7     the  -0.2066592
## 8   tree.  -1.5817576

`causal_lp_mats`

From what I can tell, this function outputs a matrix with probabilities of the full vocabulary at each word position in the x argument vector. This isn’t very clear from the help description though. I can’t tell from the example provided why this is output as a matrix rather than a dataframe.

The columns are based on the BPE tokens, not words (for example, there are separate columns for “doesn” and “’t”), so the title/description should not be about “possible word” but token instead. This should also be more clearly documented since the causal_lp output adds the relevant log probs to give the total word (I can see why this function is different, but it could be unexpected for the user).

It would be useful to have an example in the help or vignettes of how to run this one multiple sentences, and how to wrangle the output.

`causal_tokens_lp_tbl`

Returns for tokens instead of words.

It is confusing that the texts argument seems to be the same as the x argument in other functions - is there a reason for the difference? This comes back to my comment above about the x argument - perhaps it should be generally be separated into two arguments for the causal_lp functions, one which is like texts here and one which is just for targets. Why is list input an option here for texts but not for x? I don’t see an example of how it would be used.

I find the .id argument name to be unintuitive. For whatever reason .id seems to me like it is requiring an existing ID input, whereas it is asking for an output column name to be set.

`masked_lp`

Log probs using masked bidirectional models.

See my general comments above about preferring a different argument structure.

`masked_tokens_tbl`

This function returns the probabilities of all vocabulary tokens for the masked positions. Very useful, but a bit confusing how different this is from the similarly named causal_tokens_lp_tbl which gets probabilities for all of the tokens in the sentence. This one is more similar to causal_lp_mats but doesn’t return matrices (why the distinction?).

Maybe a naming distinction could be made for when a function is across the vocabulary vs. the “text” tokens?

Test Failures

Failure output from devtools::test():

Failure (test-tr_causal.R:152:3): can handle extra parameters
token_lp$token (`actual`) not equal to token_lp2$token[-1] (`expected`).

`actual[1:4]`:   "This" "Ġisn" "'t" "Ġit"
`expected[1:3]`:        "Ġisn" "'t" "Ġit"

Failure (test-tr_causal.R:153:3): can handle extra parameters
`token_lp2` (`actual`) not equal to `token_lp3` (`expected`).

  `attr(actual, 'row.names')[3:5]`: 3 4 5  
`attr(expected, 'row.names')[3:6]`: 3 4 5 6

actual vs expected
                        token            lp
- actual[1, ]   This                     NA
- actual[2, ]   Ġisn          -6.7738981247
- actual[3, ]   't            -0.0009036748
- actual[4, ]   Ġit           -7.1741533279
- actual[5, ]   .             -1.0767078400
+ expected[1, ] <|endoftext|>            NA
+ expected[2, ] This          -4.8580427170
+ expected[3, ] Ġisn          -5.5722875595
+ expected[4, ] 't            -0.0003263418
+ expected[5, ] Ġit           -6.7451515198
+ expected[6, ] .             -0.7511805296

`actual$token[1:3]`:                   "This" "Ġisn" "'t"
`expected$token[1:4]`: "<|endoftext|>" "This" "Ġisn" "'t"

    actual$lp             | expected$lp              
[1] NA                    | NA                    [1]
[2] -6.77389812469482     - -4.85804271697998     [2]
[3] -0.000903674808796495 - -5.57228755950928     [3]
[4] -7.17415332794189     - -0.000326341774780303 [4]
[5] -1.07670783996582     - -6.74515151977539     [5]
                          - -0.751180529594421    [6]

Also a warning:

Warning (test-tr_causal.R:164:3): can handle extra parameters
longer argument not a multiple of length of shorter
Backtrace:
 1. pangoling::causal_lp(x = c("This", "is", "it"), add_special_tokens = TRUE)
      at test-tr_causal.R:164:2
 2. tidytable::pmap(...)
      at pangoling/R/tr_causal.R:217:2
 4. base::mapply(...)
 5. pangoling (local) `<fn>`(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]])
 6. pangoling:::word_lp(...)
      at pangoling/R/tr_causal.R:233:6
 7. tidytable::map2_dbl(...)
      at pangoling/R/tr_utils.R:324:2
 8. tidytable::map2(.x, .y, .f, ...)
 9. base::mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE)

May 30 '23 16:05 lisalevinson

Thank you both! Very useful! I'm right now in the middle of grading exams and thesis. I think the guidelines mention 2 weeks, not entirely sure if I'll finish it in time. But I'll do my best.

May 31 '23 09:05 bnicenboim

ok, I'm still struggling with docker to answer the main concern of @utkuturk and check how the installation goes. Hopefully, I'll be able to figure out how does an installation from scratch looks like.

In the meanwhile I have a question for @lisalevinson (but utku feel free to comment).

Lisa was confused with the two uses of causal_lp.

Ok, first load the packages.

library(pangoling)
library(tidytable)

There are two main formats that the causal_lp addresses. One is you have data word-by-word (or phrase-by-phrase) which is common stimuli for many self-paced reading, eye-tracking, ERP experiments that deal with naturalistic texts. (other_word_level_stuff represents some other info that one has about the words)

df_psychl <- tidytable(
  sent_n =
    c(1, 1, 1, 1, 1, 1, 1, 1, 2,2, 2, 2, 2, 2, 2),
  word = c("The", "apple", "doesn't","fall", "far", "from", "the", "tree.",
    "Don't", "judge", "a", "book", "by", "its", "cover."),    other_word_level_stuff = NA
)

df_psychl
#> # A tidytable: 15 × 3
#>    sent_n word    other_word_level_stuff
#>     <dbl> <chr>   <lgl>                 
#>  1      1 The     NA                    
#>  2      1 apple   NA                    
#>  3      1 doesn't NA                    
#>  4      1 fall    NA                    
#>  5      1 far     NA                    
#>  6      1 from    NA                    
#>  7      1 the     NA                    
#>  8      1 tree.   NA                    
#>  9      2 Don't   NA                    
#> 10      2 judge   NA                    
#> 11      2 a       NA                    
#> 12      2 book    NA                    
#> 13      2 by      NA                    
#> 14      2 its     NA                    
#> 15      2 cover.  NA

If the two sentences are not part of one single text, that's when you must use .by, otherwise it will think that the two sentences are related.

df_psychl <- df_psychl |>
  mutate(lp = causal_lp(x = word, .by = sent_n))
#> Processing using causal model ''...
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `Don't judge a book by its cover.`
df_psychl
#> # A tidytable: 15 × 4
#>    sent_n word    other_word_level_stuff      lp
#>     <dbl> <chr>   <lgl>                    <dbl>
#>  1      1 The     NA                      NA    
#>  2      1 apple   NA                     -10.9  
#>  3      1 doesn't NA                      -5.50 
#>  4      1 fall    NA                      -3.60 
#>  5      1 far     NA                      -2.91 
#>  6      1 from    NA                      -0.745
#>  7      1 the     NA                      -0.207
#>  8      1 tree.   NA                      -1.58 
#>  9      2 Don't   NA                      NA    
#> 10      2 judge   NA                      -6.27 
#> 11      2 a       NA                      -2.33 
#> 12      2 book    NA                      -1.97 
#> 13      2 by      NA                      -0.409
#> 14      2 its     NA                      -0.257
#> 15      2 cover.  NA                      -1.38

The other use case is an experiment where the stimuli are also sentences, regardless of how you present the context, you only care about the critical region:

df_psychl2 <- tidytable(
  item_n =
    c(1, 2, 3, 4),
  context = c("The apple doesn't fall far from the",
              "The apple doesn't fall far from the",
              "Don't judge a book by its",
              "Don't judge a book by its"),
  critical = c("tree.","floor.","cover.","back."),
  other_word_level_stuff = NA
)

df_psychl2
#> # A tidytable: 4 × 4
#>   item_n context                             critical other_word_level_stuff
#>    <dbl> <chr>                               <chr>    <lgl>                 
#> 1      1 The apple doesn't fall far from the tree.    NA                    
#> 2      2 The apple doesn't fall far from the floor.   NA                    
#> 3      3 Don't judge a book by its           cover.   NA                    
#> 4      4 Don't judge a book by its           back.    NA

In that case each row is an item, and then .by is not needed, but you need to have the l_context argument.

df_psychl2 <- df_psychl2 |>
  mutate(lp = causal_lp(x = critical, l_contexts = context))
#> Processing using causal model ''...
#> Ignoring `.by` argument
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 10 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Processing a batch of size 1 with 9 tokens.
#> Text id: 1
#> `The apple doesn't fall far from the tree.`
#> Text id: 2
#> `The apple doesn't fall far from the floor.`
#> Text id: 3
#> `Don't judge a book by its cover.`
#> Text id: 4
#> `Don't judge a book by its back.`
df_psychl2
#> # A tidytable: 4 × 5
#>   item_n context                          critical other_word_level_stuff     lp
#>    <dbl> <chr>                            <chr>    <lgl>                   <dbl>
#> 1      1 The apple doesn't fall far from… tree.    NA                      -1.58
#> 2      2 The apple doesn't fall far from… floor.   NA                     -10.2 
#> 3      3 Don't judge a book by its        cover.   NA                      -1.38
#> 4      4 Don't judge a book by its        back.    NA                     -17.2

^{Created on 2023-06-07 with reprex v2.0.2}

Do you think that having better examples would help? Or that I should name the arguments differently or that I should have two different functions?

Also for both of you, do you think I should change x, to targets to match the masked models, or to change the masked models targets to x?

Also .by is the only argument with a dot, I have it like this to match dplyr/tidytable, should I keep it like it is? Or add dots everywhere or remove dots everywhere? I think tidyverse functions tend to have dots everywhere, so that would make sense, right?

Jun 07 '23 13:06 bnicenboim

software-review software-review copied to clipboard

submission: pangoling: Access to word predictability using large language (transformer) models

Due date for @utkuturk: 2023-05-29 Archive: TBD Version accepted: TBD Language: en

Scope

Technical checks

Publication options

Code of conduct

Checks for pangoling (v0.0.0.9005)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

Editor-in-Chief Instructions:

Editor checks:

Editor comments

Package Review

Documentation

Functionality

Review Comments

Package Review

Documentation

Functionality

Review Comments

Further Comments on Documentation

Workflow

Reporting and Documentation of Probabilities

Output Formats

Non-English Usage and Examples

Specific Function Comments

causal_lp

causal_lp_mats

causal_tokens_lp_tbl

masked_lp

masked_tokens_tbl

Test Failures

software-review
software-review copied to clipboard

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

`causal_lp`

`causal_lp_mats`

`causal_tokens_lp_tbl`

`masked_lp`

`masked_tokens_tbl`