devalias.net icon indicating copy to clipboard operation
devalias.net copied to clipboard

Speed up jekyll related posts functionality (--lsi, classifier-reborn, gsl, nmatrix, narray, Numo:NArray, Numo:GSL)

Open 0xdevalias opened this issue 5 years ago • 6 comments

(See also: https://github.com/0xdevalias/devalias.net/issues/1)

Jekyll can "create an index for related posts" using the --lsi build command option, which uses the classifier-reborn gem to create a site variable of related posts:

  • https://jekyllrb.com/docs/configuration/options/#build-command-options
  • https://jekyllrb.com/docs/variables/#site-variables
    • site.related_posts: If the page being processed is a Post, this contains a list of up to ten related Posts. By default, these are the ten most recent posts. For high quality but slow to compute results, run the jekyll command with the --lsi (latent semantic indexing) option. Also note GitHub Pages does not support the lsi option when generating sites.

  • https://jekyll.github.io/classifier-reborn/
  • https://jekyll.github.io/classifier-reborn/#dependencies
    • To speed up LSI classification by at least 10x consider installing following libraries.

      • http://www.gnu.org/software/gsl
      • https://rubygems.org/gems/gsl
    • Note that LSI will work without these libraries, but as soon as they are installed, classifier will make use of them.

More info on Jekyll's usage of LSI:

  • https://www.aravindiyer.com/tech/yet-another-solution-related-posts-jekyll/
  • http://sangsoonam.github.io/2019/01/15/random-related-posts-in-jekyll.html

The gsl gem can make use of nmatrix and narray:

  • https://github.com/SciRuby/rb-gsl#nmatrix-and-narray-usage
  • https://github.com/SciRuby/nmatrix
  • https://github.com/masa16/narray

narray is in maintenance mode, and directs to numo-narray:

  • https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray
  • https://github.com/ruby-numo/numo-narray

numo-gsl provides a GSL interface for Ruby with Numo::NArray:

  • https://github.com/ruby-numo/numo-gsl

I'm unsure if the numo gems can be used with classifier-reborn, and which of nmatrix/narray provide better speed; but I created an issue asking:

  • https://github.com/jekyll/classifier-reborn/issues/192

As noted in https://github.com/jekyll/classifier-reborn/issues/193, i'm not sure if classifier-reborn is actively updated/maintained.


nmatrix was last updated in 2018, and at least one issue claims that Numo::NArray outperforms NMatrix

Several years have passed since the new version of NArray came out.

It appeared that NMatrix was not being maintained well. And I think Numo::NArray now outperforms NMatrix in almost every way. (benchmark needed)

Newcomers try NMatrix first. After a while, they notice that NArray is far better in performance. And they begin to make libraries dependent on NArray.


rb-gsl was last updated in 2017, and claims compatibility only with GSL versions up to v2.1:

Ruby/GSL is compatible with GSL versions upto 2.1.

I've asked if it is still maintained, but my guess is probably not:

  • https://github.com/SciRuby/rb-gsl/issues/63

My comment in reply to the following StackOverflow question:

  • https://stackoverflow.com/questions/51439500/jekyll-build-lsi-make-blog-very-very-slow/63006543#63006543
The `--lsi` option comes from the [`classifier-reborn`][1] gem, which includes the following note about increasing speed under the [dependencies][2] heading:

> To speed up LSI classification by at least 10x consider installing
> following libraries.
> 
> [GSL - GNU Scientific Library][3]
>
> [Ruby/GSL Gem][4]
> 
> Note that LSI will work without these libraries, but as soon as they
> are installed, classifier will make use of them. No configuration
> changes are needed, we like to keep things ridiculously easy for you.

The [`gsl` gem's installation docs][5] mentions:

> the GSL libraries must already be installed before Ruby/GSL can be installed:
>
> - Debian/Ubuntu: +libgsl0-dev+
> - Fedora/SuSE: +gsl-devel+
> - Gentoo: +sci-libs/gsl+
> - OS X: `brew install gsl`

The [`gsl` gem can also make use of `nmatrix` or `narray`][6], which I believe may further increase the speed/efficiency:

> In order to use rb-gsl with NMatrix you must first set the NMATRIX
> environment variable and then install rb-gsl:
> - `gem install nmatrix`
> - `export NMATRIX=1`
> - `gem install rb-gsl`
> 
> This will compile rb-gsl with NMatrix specific functions.
> 
> For using rb-gsl with NArray:
> - `gem install narray`
> - `export NARRAY=1`
> - `gem install rb-gsl`
> 
> Note that setting both `NMATRIX` and `NARRAY` variables will lead to
> undefined behaviour. Only one can be used at a time.

I'm not sure whether `nmatrix` or `narray` is the better/faster choice, though I did open `https://github.com/jekyll/classifier-reborn/issues/192` on the `classifier-reborn` repo.

I did notice that the old [narray GitHub repo][7] mentions that the package is no longer maintained, and instead links to a new version: [Ruby/Numo::NArray][8]

> Numo::NArray is an Numerical N-dimensional Array class for fast processing and easy manipulation of multi-dimensional numerical data, similar to numpy.ndaray. This project is the successor to Ruby/NArray.

Numo::NArray also links to [`numo-gsl`][9], which appears to be related gsl bindings:

> GSL interface for Ruby/Numo::NArray

At this stage i'm not sure if `classifier-reborn` is able to make use of any of these numo dependencies, but if it can, my guess is that they are going to be faster/more actively maintained.

  [1]: https://jekyll.github.io/classifier-reborn/
  [2]: https://jekyll.github.io/classifier-reborn/#dependencies
  [3]: http://www.gnu.org/software/gsl
  [4]: https://rubygems.org/gems/gsl
  [5]: https://github.com/SciRuby/rb-gsl#installation
  [6]: https://github.com/SciRuby/rb-gsl#nmatrix-and-narray-usage
  [7]: https://github.com/masa16/narray#new-version-is-under-development---rubynumonarray
  [8]: https://github.com/ruby-numo/narray
  [9]: https://github.com/ruby-numo/numo-gsl

See Also

  • https://github.com/0xdevalias/devalias.net/issues/87

0xdevalias avatar Jul 21 '20 02:07 0xdevalias

Found some benchmarks comparing various underlying options (though they seem rather outdated)

  • https://gist.github.com/colstrom/1bc7ea694286c0a96119295a8866a857
    • Ruby Vector Benchmarks (matrix vs nmatrix vs numo/narray)

0xdevalias avatar Jul 22 '20 00:07 0xdevalias

Looking at site build times in https://github.com/0xdevalias/devalias.net/pull/86#issuecomment-663932727 with/without --lsi, gsl, nmatrix, etc; it seemed to have negligible impact regardless of which we used.

In light of that.. this thread about optimisations may not even be relevant anymore..

0xdevalias avatar Jul 26 '20 04:07 0xdevalias

:wave: Hi,

I stumbled onto this thread from https://github.com/jekyll/classifier-reborn/issues/193.

A few notes that you might find helpful:

  • You're not noticing any difference in build times with the --lsi option because your site (as it is today in this repo) doesn't use related posts (so the --lsi option does nothing). To use LSI, you need to call site.related_posts somewhere in a Liquid template. For example, you might add something like the following to _layouts/post.html:
    {% for post in site.related_posts limit:3 %}
      <p>{{ post.title }}</p>
    {% endfor %}
    
  • When you call site.related_posts, if you don't pass the --lsi option, it's just recent posts.
  • If you are using site.related_posts and you pass the --lsi option, You'll see Populating LSI... in your jekyll build --lsi output. The build will be slow unless you have the gsl gem and native gsl library installed. I haven't experimented with nmatrix or narray at all, but simply using the gsl gem results in a ~500x speed increase for my use.

Hope that helps. I appreciated some of your comments on some of the libraries so I thought I'd share some notes with you!

mkasberg avatar May 22 '21 21:05 mkasberg

@mkasberg Thanks for the notes and insights :) Much appreciated.

I’d have to look deeper at things (has been a long time since I did), but if the related_posts part isn’t there anymore then I guess I must have removed it from my templates at some stage. I know I had it at one point. Maybe the speed thing was why I removed it.

If/when I get back to looking at this I’ll make sure to check that out first!

0xdevalias avatar May 22 '21 23:05 0xdevalias