orama feat(lyra): Add WebAssembly support

Context

This PR introduces Rust+WebAssembly support in Lyra, as asked privately by @micheleriva. I'm currently targeting Node.js only, but you can extend this PR as needed for supporting Deno, browsers, and other JS runtimes.

As a motivating example, we were asked to write a skeleton for the intersectTokenScores function originally defined here in favor of the intersect_token_scores defined in the new lyra-utils crate here (and exposed to TypeScript via the lyra-utils-wasm crate here).

Tests

It should be noted that tests are currently failing, but we believe that is only due to a different ordering strategy used by Rust for intersect_token_scores. However, we invite the Lyra authors to carefully check that's the case.

How to build Rust → Wasm artifacts

With `Rust` and `Node`

Install Rust v1.6.5
Install Node v16.15.1 or superior
cargo update -p wasm-bindgen
cargo install -f [email protected]
cd ./rust
(cd ./scripts && npm i)
Optional: export LYRA_WASM_PROFILE="release"
Optional: export LYRA_WASM_TARGET="nodejs"
node ./scripts/wasmAll.mjs

With `docker` (used by the CI)

Install Docker

docker buildx build --load \
  -f Dockerfile --build-context rust=rust \
  . -t lyrasearch/lyra-wasm \
  --progress plain

docker create --name tmp lyrasearch/lyra-wasm
docker cp dummy:/opt/app/src/wasm ./src/wasm
docker rm -f tmp

In both cases, you should observe the following artifacts in ./src/wasm/:

lyra_utils_wasm_bg.wasm
lyra_utils_wasm_bg.wasm.d.ts
lyra_utils_wasm.d.ts
lyra_utils_wasm.js

This will need to be included in the bundler of your choice. Moreover, you likely do not wish to store these artifacts in the repo, but would rather generate them on the fly in the CI. Feel free to change this as you see fit.

Nov 22 '22 00:11 jkomyno

PR working fine on Node.js via ESM and CJS. TextEncoder class breaks on browsers, trying to fix it right now.

Nov 24 '22 10:11 micheleriva

Life is pain, but this PR is suffering folks

Nov 29 '22 08:11 micheleriva

Note: in the CI, I see warnings about requires in the automatically generated Wasm bindings. When bundling "nodejs esm" modules, wasm-bindgen should be called with the --target=bundler option, as documented here.

Nov 30 '22 22:11 jkomyno

WASM support is done on browsers and servers. Now time to fix GH CI

Dec 01 '22 12:12 micheleriva

CI is fixed, but don't merge yet

Dec 03 '22 20:12 micheleriva

Ok so @ShogunPanda, @jkomyno, this PR is ready to get merged.

My main concern is that it will force us to introduce a breaking change by making the search function async, as we need to use dynamic imports to load the correct WASM file for every runtime.

The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.

My idea would be either to:

Create a searchAsync function (or with a similar name) that returns a promise and uses WASM underneath
Move search to a promise-based approach to support WASM without breaking on different runtimes

A nice fact is that given how Lyra is built, functions are totally isolated and tree-shakable, so it's up to the user to decide if they want to import the WASM implementation or use the default JS fallback.

Pros for making search async by default:

The WASM implementation is a LOT faster (official benchmark will come soon), so the throughput might not be a real problem IMHO

Cons of making search async:

Would force the user to download ~100kb WASM file, not a problem on the server but I can see why people might not like this on a browser.

I think maintaining two separate search functions could also be a problem if we don't split up the function's internals, but by doing that, we will likely slow down the overall execution by introducing more context-switching and function calls.

Let me know what you think folks 🙏

Dec 03 '22 20:12 micheleriva

The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.

How would it reduce the throughput?

Also, my 2 cents is to introduce the breaking change by making search return a promise. However, it doesn't mean it will be 100% async. Even returning a promise I think when no WASM it will be synchronous.

Dec 03 '22 21:12 RafaelGSS

I'd go for the async search, I dont see why performance would be inferior. When I changed the tree I was wondering to propose it because I wanted to make findAllWords async. Lets benchmark the current implementation with async and see the difference.

Dec 03 '22 21:12 marco-ippolito

@RafaelGSS

How would it reduce the throughput?

There's a lot of literature regarding promises performances, and I am wondering if that is a legit concern for Lyra; just to name a couple of great articles (there are also benchmarks in there):

https://madelinemiller.dev/blog/javascript-promise-overhead
https://fusebit.io/blog/promise-performance-node

But you're the expert here Rafael, I trust you 🙂

I know @ShogunPanda has a different opinion on promises though, so let's wait for him too.

One other concern I have is about DX and consistency: Lyra will expose the following fundamental functions:

create (sync)
insert (sync)
delete (sync)
search (async)
insertBatch (async)

While the reason why insertBatch is async would be clear to the user (as stated in the docs, it will avoid event loop freezes), it's not immediately clear why we can create a new db, insert, delete data synchronously, but we need to search asynchronously. I think having a consistent approach in all the "key" functions would be better.

About @marco-ippolito comment: I think we could easily make a couple of search internals async as well. So yes, that's another good point for async search.

That said, I'd also personally prefer making search async, these are just my concerns.

Dec 04 '22 07:12 micheleriva

I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.

I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise. On the caller side, the developer can either choose to always await or to detect it.

In other words, the function will become something like this (pseudocode):

function search<T>(/* */): T | Promise<T> {
  // Do something

  const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments

  if(typeof intersected.then === function) {
    return intersected.then(sets => finalizeSearch(sets))
  } else {
    return finalizeSearch(intersected)
  }
}

Dec 05 '22 23:12 ShogunPanda

@ShogunPanda my main concern with this approach is that it can be confusing for some developers... here's my proposal.

I'd ship WASM as an experimental feature for now, that can be enabled in the following way:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

The optimizations property will contain all the code optimizations that we will perform on Lyra. While in an experimental status, the wasm property will be false by default.

Given that we always pass a Lyra instance to the search function, we can easily detect whether the wasm option is set as true or false:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

search(db, { term: 'foo' }) // <--- search knows if wasm interop is enabled

Given that any Lyra instance is essentially an object, we can override this value with ease on demand:

import { create } from '@lyrasearch/lyra'

const db = create({
  schema: { foo: 'string' },
  optimizations: {
    wasm: true
  }
})

db.optimizations.wasm = false

I'm not sure I like it, but might be useful for benchmarks and tests while in an experimental stage.

With that being said, we could easily change the search type signature from T to Promise<T> depending on the Lyra instance passed as a first argument, but I see a bad DX pattern there. By mutating the original optimization.wasm parameter, we would make the signature uncertain and possibly non-deterministic.

I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.

Dec 06 '22 08:12 micheleriva

I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.

I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise. On the caller side, the developer can either choose to always await or to detect it.

In other words, the function will become something like this (pseudocode):
function search<T>(/* */): T | Promise<T> {
  // Do something

  const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments

  if(typeof intersected.then === function) {
    return intersected.then(sets => finalizeSearch(sets))
  } else {
    return finalizeSearch(intersected)
  }
}

I'm usually not a big fan of this approach, It's kinda hard to maintain (fast-jwt verify 💀) and to build new components above because you always have to think whether it will return a promise or not. Most of the time you will put an await before independently just not to deal with the dual type return.

I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.

I'd choose whether to go promise by default or asyncSearch based on the benchmark results.

Dec 06 '22 08:12 marco-ippolito

@micheleriva @marco-ippolito I see both your points. Let's try something different.

What about exposing two different interfaces and forbid passing optimizations to the regular one?

Something like allowing both the following codes to be valid:

import { create, search } from '@lyrasearch/lyra'

/*
This create does not allow for optimizations. 
We specifically validate this in Javascript (instead of relying on TS types) so that we
can guide the user in picking the right one.
*/
const db = create(/* ... */)

const results = search(db)

and

import { create, search } from '@lyrasearch/lyra/async'

/*
Technically create should not be async (yet).
But since we are establishing the new interface we declare as async now so 
that we don't need a breaking change later.

Create here accepts optimizations.
*/
const db = await create(/* ... */)
const results = await search(db)

At the beginning of each methods for both version I would add a quick check to avoid mixing API. This way the DX should be fine, and we will have room for future expansion. Moreover it will encourage us to write composable code.

Dec 06 '22 08:12 ShogunPanda

@ShogunPanda not a bad idea. So we'd basically alias the methods right?

Dec 06 '22 08:12 micheleriva

Pretty much. In the async version we allow most parts (intersecter, tokenizer, stemmer and so forth) to be pluggable and async.

Dec 06 '22 08:12 ShogunPanda

@ShogunPanda I love this idea. it's like the require('fs').promises

Dec 06 '22 09:12 marco-ippolito

@ShogunPanda I love this idea. it's like the require('fs').promises

How do you know where I copied the idea from? 😁

Dec 06 '22 09:12 ShogunPanda

@jkomyno I keep on getting the following error when importing the compiled JS binding:

  1) tests/lyra.dataset.test.ts lyra.dataset should correctly populate the database with a large dataset `unwrap_throw` failed:
     Error: `unwrap_throw` failed
      at module.exports.__wbindgen_throw (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:178:9)
      at wasm://wasm/000671c6:wasm-function[126]:0x13c5f
      at wasm://wasm/000671c6:wasm-function[8]:0x51f5
      at Object.module.exports.intersectTokenScores (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:128:20)
      at intersectTokenScores (src/wasm/loader.ts:8:24)
      at search (src/lyra.ts:673:57)
      at Test.<anonymous> (tests/lyra.dataset.test.ts:71:28)

Here is the full CI log: https://github.com/LyraSearch/lyra/actions/runs/3628897502/jobs/6120466268#step:5:627

any idea?

Dec 06 '22 11:12 micheleriva

Update: we decided that ALL the Lyra functions will be async. Currently working on this.

Dec 14 '22 08:12 micheleriva

All Lyra functions are now async. WASM support will be experimental, and users will have to opt-in via the following interface:

import { create } from '@lyrasearch/lyra'
import { intersectTokenScores } from '@lyrasearch/lyra/dist/esm/wasm/intersectTokenScores' // placeholder 

await create({
  schema: {
    foo: 'string'
  },
  components: {
    algorithms: {
      intersectTokenScores
    }
  }
})

This will provide a unified interface that will allow people to bring their own optimizations, and optionally opt-in for the built-in ones.

Gonna merge this and provide a separate build system for our WASM optimizations.

Thank you all folks, what a ride!

Dec 14 '22 14:12 micheleriva

feat(lyra): Add WebAssembly support

Context

Tests

How to build Rust → Wasm artifacts

With Rust and Node

With docker (used by the CI)

With `Rust` and `Node`

With `docker` (used by the CI)