unconf18 icon indicating copy to clipboard operation
unconf18 copied to clipboard

Tools for discovering new packages (again)

Open mpadge opened this issue 7 years ago • 7 comments

Direct follow-on from last year's two related issues issues thanks to @sfirke. The flipper package is kinda developed, kinda stalled, but I personally would love to get that a bit more developed. It currently does full heavyweight text analysis of DESCRIPTION files of all CRAN packages and produces a document similarity matrix that is used to connect one package to another.

The original vision of @njtierney was a standard swipe interface which we re-branded "flip" to enable quick and easy package browsing. In current state, one can simply:

flipper::flip ("package about a bunch of interesting stuff")

And it'll find a starting point in the matrix and then traverse strongest connections. We think that alone is kinda nifty, so please try! Required/desired refinements include:

  1. Refining methods of traversing the matrix, including incorporating user stats with all associated concerns raised in previous issue. Extension to an ML framework would be very straightforward, because the whole thing works on fixed-sized binary vectors (like/dislike next jump along vector).
  2. As @jimhester pointed out in original issue, trawling man files is likely to be even more informative. The infrastructure for this is all there, but it might push the limits of text similarity matrix processing?
  3. Extension to all non-CRAN packages on github (I know there's a list somewhere, and @maelle has her excellent is_package function for repo enquiry.)
  4. Slick flippable interface

That's all it would take to have most of the infrastructure there for one to type some text and start flipping through R packages until one discovered something desirable, interesting, or at least unexpected.

mpadge avatar Apr 15 '18 16:04 mpadge

What about incorporating something like https://github.com/ropenscilabs/packagemetrics to the information returned, so that when searching for a package you get not only a description, but indicators of popularity and quality?

noamross avatar Apr 25 '18 03:04 noamross

That's actually a great example. Incorporating new data means either a new network weighting matrix (for edges, or relationships between packages), or a new vector of nodal properties. I've been mostly concentrating on the former, but packagemetrics is a great example of the latter. These are computationally much cheaper, and equally important. Nodal vectors then need to be translated into edge matrices to guide traversal algorithms, and I also haven't explored that translation yet at all. packagemetrics would provide a fine opportunity to develop that too.

mpadge avatar Apr 25 '18 06:04 mpadge

Possibly a project of its own, but one of the ways in which I discover packages is through use cases, often in the form of blog posts. I have yet to come up with an idea that wouldn't be "hackable" in some capacity (i.e. there'd be nothing to stop someone to load a bunch of packages just for the pings, or whatever), but I'd be curious to think about a way of somehow highlighting package (or even function) usage in a way that ties back to the package itself- package usage in the wild, if you will.

batpigandme avatar Apr 25 '18 09:04 batpigandme

Ah I love this @batpigandme! I've thought about adding this to the guidelines for rOpenSci packages https://github.com/ropensci/onboarding-meta/issues/39 , in practice it'd be a list inside README which is maybe not optimal since the README is often .Rbuildignored. But the README lives in the pkgdown website.

maelle avatar Apr 25 '18 09:04 maelle

obviously I'm open to suggestions of better ways to save this information in the package docs!

maelle avatar Apr 25 '18 09:04 maelle

It occurs to me that some of this work would be applicable to the editorial need of finding authors. For instance, given a package submission, could we identify packages with similar uses, dependencies, or even code patterns? If so, that package's author might make a good reviewer for the submission, and the same analysis could also alert us to overlap between the submission and another package.

That said, I'm painfully aware that journals have such systems for finding potential reviewers for manuscripts and the results are rarely really helpful. Not sure if they are just based on keyword matching or something like that, though.

noamross avatar May 02 '18 11:05 noamross

About a year ago I made a Shiny App to do package recommendation based on an initial selection of package(s) that you are interested in using -- http://recommendr.info/ I never really shared publicly because the code is a total mess... the app code is here here and a helper package here, but the data ingest code is not currently on github and a total disaster. The app uses two different approaches that lead to different type of recommendations:

  • Package co-use in R scripts found on github (using the GitHub bigquery data set) -- use matrix factorization (ben frederickson's blog has a nice overview of technique)
  • Documentation similarity -- use man pages for packages as input, apply tf-idf and use cosine similarity to pick similar package. This required some tuning of the "cleaning" stage of the nlp processing as some words in code documentation are not informative but too aggressive cleaning of things like punctuation removes highly informative words like "c++".

The app is now a year out of date as the data ingest was a one time thing back in June 2017... I had wanted to update the data ingest to be an actually reproducible process before sharing this but that has not happened.... thought I'd share now in case folks at the unconf decide to pursue this, as some of the ideas at least might be re-usable! I think the matrix factorization approach could work quite nicely especially if the dataset of r scripts was expanded to include gists and blog posts...

AliciaSchep avatar May 21 '18 15:05 AliciaSchep