stellar
stellar copied to clipboard
Text search complexity vs dependencies
I'm not sure how lightweight you want stellar to be, but quanteda is a fairly heavy dependency, itself bringing in ggplot2, various Rcpp packages, half the tidyverse, parallel packages, and many, many more.
I tend to find it somewhat imposing on a user's package library to install a lot of new versions of things which may break their workflow. I mean... yes, we're all up-to-date with everything, right?
What about doing just a simple string match search by default, and having the option to do a full corpus/wordstem search on request, with quanteda in Suggests?
Yeah, that is definitely something I'm aware of, and that would be one workable solution. The big advantage of "proper" text processing is (in this context) really just the stemming, and quanteda is arguably an overly inflated dep just for that. One easily workable alternative might be hunspell, which internally bundles all the core libs, so is impressively lightweight yet still offers stemming.
These approaches are what enable general descriptive phrases to be entered; processing these kinds of things via simple string matching is much harder, and I do think it's nifty to be able to:
stars ("oh, i dunno, it was something about text and graphs or visualisation, and had some circles")
String-matching that would just generate garbage. How about dropping quanteda in favour of hunspell? A much lighter dependency burden that should still enable full functionality.
... then I'd mostly just have to recode the single quanteda::kwic call. That'd be do-able ...
I've been pondering the scale of this issue. In short: It needs another package because there's been so much amazing development on text analysis in R that has ultimately led to the creation of enormously powerful beasts like quanteda and tm. What I quite often need, and what this kinda package requires, is a straight-up text matching function without the additional overhead of the mega-packages. The closest thing I know of at the mo is stringdist, which is really cool, but it would be relatively straight forward and perhaps even more useful to have a single, isolated, and intentionally as-lightweight-as-possible package for string matching only.
res <- stringmatch (dat, mode = "preferred_mode")
dat could be either a rectangular object or list, and res will be the same object sorted by match scores according to specified mode of matching, along with appended match score. That would just need stemming, done in my admittedly limited perspective of the current state of things in the lightest possible way via hunspell.
I nevertheless suspect that it will be better to implement the matching in Rcpp because
- It will ultimately be more efficient; and
- The code will ultimately end up more readable than the alternative of one huge
purrr:mapcall.
Each matching step is quite complicated, requiring iterative scanning of all permutations of tokens as demonstrated in current code here.
That would then replace quanteda dependency with hunspell, Rcpp, and whatever that package might be called. The package could also potentially be considered for onboarding according to current policies.
The Problem
- I ain't got no time to do this, and it's way beyond scope of what I oughta be doing, making it unlikely to happen
- Is that really such a huge saving, when one admittedly huge package is replaced by three which add up to something potentially only marginally smaller anyway?
The Workaround for the Moment
I could potentially find sufficient time to start a branch of this repo and sketch a plausible version with a single mode of matching for stellar purposes. Doing that would likely give us/me/whomever a clearer idea of what might be required. The time issue would still affect that, but might be able to squeeze together a minimal working demo relatively quickly.
See also ROpenSci's own tokenizers package, which uses the snowballC package for the hard work.
@jonocarroll thoughts here please. This code tokenizes the Description texts of all R packages using 3 different packages for the task::
db <- tools::CRAN_package_db ()
txt <- db$Description
sw <- stopwords::stopwords ("en")
# -------- tokenizers -------
ftok <- function (txt)
{
lapply (txt, function (i) {
d <- strsplit (i, split = " ") [[1]]
tk <- unlist (tokenizers::tokenize_word_stems (d,
stopwords = sw))
tk [!tk == "NA"]
})
}
# -------- hunspell -------
fhun <- function (txt)
{
lapply (txt, function (i) {
d <- strsplit (i, split = " ") [[1]]
tk <- unlist (hunspell::hunspell_stem (d))
tk [!tk %in% sw]
})
}
# -------- quanteda -------
rqnt <- function (txt)
{
txt %>%
quanteda::char_tolower () %>%
quanteda::corpus () %>%
quanteda::texts () %>%
quanteda::tokens () %>%
quanteda::tokens_wordstem ()
}
rbenchmark::benchmark (
ftok (txt),
fhun (txt),
rqnt (txt),
replications = 2)
and the results ...
test replications elapsed relative user.self sys.self user.child sys.child
2 fhun(txt) 2 17.957 7.504 17.093 0.832 0 0
1 ftok(txt) 2 12.439 5.198 12.372 0.046 0 0
3 rqnt(txt) 2 2.393 1.000 2.474 0.013 0 0
quanteda is big for a reason. My main thought: 2.5s is likely an acceptable time for people to unexpectedly have to wait for first call to stellar or flipper; but once such unexpected waits extend beyond 10s, most folk will likely have already hit CTRL-C.
The further development of both this package and flipper now depends on some kind of opinionated decisions regarding size versus speed. See equivalent issue in stellar
(I finally had a moment to think about this)
What is the scope of this function? Do you intend to search for any package matching that phrase in some way, or just the user's starred repos? If it's the latter, then that speeds things up significantly:
s <- stellar:::getstars (whoami::gh_username(), NULL, NULL, TRUE) # 391 elements
txt <- s$description
sw <- stopwords::stopwords ("en")
# <snip>
bench::mark (
ftok (txt),
fhun (txt),
rqnt (txt),
iterations = 2,
check = FALSE)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <bch:t> <bch> <bch:> <bch> <dbl> <bch:byt> <dbl> <int>
#> 1 ftok(txt) 107.1ms 109ms 109ms 111ms 9.19 4.86MB 2 2
#> 2 fhun(txt) 411.2ms 510ms 510ms 609ms 1.96 7.98MB 2 2
#> 3 rqnt(txt) 43.3ms 155ms 155ms 266ms 6.47 1.34MB 2 2
#> # ... with 1 more variable: total_time <bch:tm>
Created on 2018-08-15 by the reprex package (v0.2.0).
Admittedly, I have only starred 391 repos, but tokenizing these results should be pretty quick. Searching for any package seems to be extending beyond the scope of stellar (this is entirely suitable elsewhere, i.e. flipper).
Personally, I'd go with tokenizers since it's a) fast, and b) ROpenSci.
Yeah, i agree, and actually realised that flipper could simply pre-store/cache the tokenized versions of package descriptions anyway, entirely avoiding the speed issue. I'll sketch out a tokenizers solution here first, then apply it to flipper. Thanks!