Speed up jekyll related posts functionality (--lsi, classifier-reborn, gsl, nmatrix, narray, Numo:NArray, Numo:GSL)
(See also: https://github.com/0xdevalias/devalias.net/issues/1)
Jekyll can "create an index for related posts" using the --lsi build command option, which uses the classifier-reborn gem to create a site variable of related posts:
- https://jekyllrb.com/docs/configuration/options/#build-command-options
- https://jekyllrb.com/docs/variables/#site-variables
-
site.related_posts: If the page being processed is a Post, this contains a list of up to ten related Posts. By default, these are the ten most recent posts. For high quality but slow to compute results, run the jekyll command with the --lsi (latent semantic indexing) option. Also note GitHub Pages does not support the lsi option when generating sites.
-
- https://jekyll.github.io/classifier-reborn/
- https://jekyll.github.io/classifier-reborn/#dependencies
-
To speed up LSI classification by at least 10x consider installing following libraries.
- http://www.gnu.org/software/gsl
- https://rubygems.org/gems/gsl
-
Note that LSI will work without these libraries, but as soon as they are installed, classifier will make use of them.
-
More info on Jekyll's usage of LSI:
- https://www.aravindiyer.com/tech/yet-another-solution-related-posts-jekyll/
- http://sangsoonam.github.io/2019/01/15/random-related-posts-in-jekyll.html
The gsl gem can make use of nmatrix and narray:
- https://github.com/SciRuby/rb-gsl#nmatrix-and-narray-usage
- https://github.com/SciRuby/nmatrix
- https://github.com/masa16/narray
narray is in maintenance mode, and directs to numo-narray:
- https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray
- https://github.com/ruby-numo/numo-narray
numo-gsl provides a GSL interface for Ruby with Numo::NArray:
- https://github.com/ruby-numo/numo-gsl
I'm unsure if the numo gems can be used with classifier-reborn, and which of nmatrix/narray provide better speed; but I created an issue asking:
- https://github.com/jekyll/classifier-reborn/issues/192
As noted in https://github.com/jekyll/classifier-reborn/issues/193, i'm not sure if classifier-reborn is actively updated/maintained.
nmatrix was last updated in 2018, and at least one issue claims that Numo::NArray outperforms NMatrix
Several years have passed since the new version of NArray came out.
It appeared that NMatrix was not being maintained well. And I think Numo::NArray now outperforms NMatrix in almost every way. (benchmark needed)
Newcomers try NMatrix first. After a while, they notice that NArray is far better in performance. And they begin to make libraries dependent on NArray.
rb-gsl was last updated in 2017, and claims compatibility only with GSL versions up to v2.1:
Ruby/GSL is compatible with GSL versions upto 2.1.
I've asked if it is still maintained, but my guess is probably not:
- https://github.com/SciRuby/rb-gsl/issues/63
My comment in reply to the following StackOverflow question:
- https://stackoverflow.com/questions/51439500/jekyll-build-lsi-make-blog-very-very-slow/63006543#63006543
The `--lsi` option comes from the [`classifier-reborn`][1] gem, which includes the following note about increasing speed under the [dependencies][2] heading:
> To speed up LSI classification by at least 10x consider installing
> following libraries.
>
> [GSL - GNU Scientific Library][3]
>
> [Ruby/GSL Gem][4]
>
> Note that LSI will work without these libraries, but as soon as they
> are installed, classifier will make use of them. No configuration
> changes are needed, we like to keep things ridiculously easy for you.
The [`gsl` gem's installation docs][5] mentions:
> the GSL libraries must already be installed before Ruby/GSL can be installed:
>
> - Debian/Ubuntu: +libgsl0-dev+
> - Fedora/SuSE: +gsl-devel+
> - Gentoo: +sci-libs/gsl+
> - OS X: `brew install gsl`
The [`gsl` gem can also make use of `nmatrix` or `narray`][6], which I believe may further increase the speed/efficiency:
> In order to use rb-gsl with NMatrix you must first set the NMATRIX
> environment variable and then install rb-gsl:
> - `gem install nmatrix`
> - `export NMATRIX=1`
> - `gem install rb-gsl`
>
> This will compile rb-gsl with NMatrix specific functions.
>
> For using rb-gsl with NArray:
> - `gem install narray`
> - `export NARRAY=1`
> - `gem install rb-gsl`
>
> Note that setting both `NMATRIX` and `NARRAY` variables will lead to
> undefined behaviour. Only one can be used at a time.
I'm not sure whether `nmatrix` or `narray` is the better/faster choice, though I did open `https://github.com/jekyll/classifier-reborn/issues/192` on the `classifier-reborn` repo.
I did notice that the old [narray GitHub repo][7] mentions that the package is no longer maintained, and instead links to a new version: [Ruby/Numo::NArray][8]
> Numo::NArray is an Numerical N-dimensional Array class for fast processing and easy manipulation of multi-dimensional numerical data, similar to numpy.ndaray. This project is the successor to Ruby/NArray.
Numo::NArray also links to [`numo-gsl`][9], which appears to be related gsl bindings:
> GSL interface for Ruby/Numo::NArray
At this stage i'm not sure if `classifier-reborn` is able to make use of any of these numo dependencies, but if it can, my guess is that they are going to be faster/more actively maintained.
[1]: https://jekyll.github.io/classifier-reborn/
[2]: https://jekyll.github.io/classifier-reborn/#dependencies
[3]: http://www.gnu.org/software/gsl
[4]: https://rubygems.org/gems/gsl
[5]: https://github.com/SciRuby/rb-gsl#installation
[6]: https://github.com/SciRuby/rb-gsl#nmatrix-and-narray-usage
[7]: https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray
[8]: https://github.com/ruby-numo/narray
[9]: https://github.com/ruby-numo/numo-gsl
See Also
- https://github.com/0xdevalias/devalias.net/issues/87
Found some benchmarks comparing various underlying options (though they seem rather outdated)
- https://gist.github.com/colstrom/1bc7ea694286c0a96119295a8866a857
-
Ruby Vector Benchmarks (matrix vs nmatrix vs numo/narray)
-
Looking at site build times in https://github.com/0xdevalias/devalias.net/pull/86#issuecomment-663932727 with/without --lsi, gsl, nmatrix, etc; it seemed to have negligible impact regardless of which we used.
In light of that.. this thread about optimisations may not even be relevant anymore..
:wave: Hi,
I stumbled onto this thread from https://github.com/jekyll/classifier-reborn/issues/193.
A few notes that you might find helpful:
- You're not noticing any difference in build times with the
--lsioption because your site (as it is today in this repo) doesn't use related posts (so the--lsioption does nothing). To use LSI, you need to callsite.related_postssomewhere in a Liquid template. For example, you might add something like the following to_layouts/post.html:{% for post in site.related_posts limit:3 %} <p>{{ post.title }}</p> {% endfor %} - When you call
site.related_posts, if you don't pass the--lsioption, it's just recent posts. - If you are using
site.related_postsand you pass the--lsioption, You'll seePopulating LSI...in yourjekyll build --lsioutput. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.
Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!
@mkasberg Thanks for the notes and insights :) Much appreciated.
I’d have to look deeper at things (has been a long time since I did), but if the related_posts part isn’t there anymore then I guess I must have removed it from my templates at some stage. I know I had it at one point. Maybe the speed thing was why I removed it.
If/when I get back to looking at this I’ll make sure to check that out first!