feat(lyra): Add WebAssembly support
Context
This PR introduces Rust+WebAssembly support in Lyra, as asked privately by @micheleriva. I'm currently targeting Node.js only, but you can extend this PR as needed for supporting Deno, browsers, and other JS runtimes.
As a motivating example, we were asked to write a skeleton for the intersectTokenScores function originally defined here in favor of the intersect_token_scores defined in the new lyra-utils crate here (and exposed to TypeScript via the lyra-utils-wasm crate here).
Tests
It should be noted that tests are currently failing, but we believe that is only due to a different ordering strategy used by Rust for intersect_token_scores. However, we invite the Lyra authors to carefully check that's the case.
How to build Rust → Wasm artifacts
With Rust and Node
- Install Rust v
1.6.5 - Install Node v
16.15.1or superior cargo update -p wasm-bindgencargo install -f [email protected]cd ./rust(cd ./scripts && npm i)- Optional:
export LYRA_WASM_PROFILE="release" - Optional:
export LYRA_WASM_TARGET="nodejs" node ./scripts/wasmAll.mjs
With docker (used by the CI)
- Install
Docker -
docker buildx build --load \ -f Dockerfile --build-context rust=rust \ . -t lyrasearch/lyra-wasm \ --progress plain docker create --name tmp lyrasearch/lyra-wasmdocker cp dummy:/opt/app/src/wasm ./src/wasmdocker rm -f tmp
In both cases, you should observe the following artifacts in ./src/wasm/:
lyra_utils_wasm_bg.wasmlyra_utils_wasm_bg.wasm.d.tslyra_utils_wasm.d.tslyra_utils_wasm.js
This will need to be included in the bundler of your choice. Moreover, you likely do not wish to store these artifacts in the repo, but would rather generate them on the fly in the CI. Feel free to change this as you see fit.
PR working fine on Node.js via ESM and CJS. TextEncoder class breaks on browsers, trying to fix it right now.
Life is pain, but this PR is suffering folks
Note: in the CI, I see warnings about requires in the automatically generated Wasm bindings.
When bundling "nodejs esm" modules, wasm-bindgen should be called with the --target=bundler option, as documented here.
WASM support is done on browsers and servers. Now time to fix GH CI
CI is fixed, but don't merge yet
Ok so @ShogunPanda, @jkomyno, this PR is ready to get merged.
My main concern is that it will force us to introduce a breaking change by making the search function async, as we need to use dynamic imports to load the correct WASM file for every runtime.
The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.
My idea would be either to:
- Create a
searchAsyncfunction (or with a similar name) that returns a promise and uses WASM underneath - Move
searchto a promise-based approach to support WASM without breaking on different runtimes
A nice fact is that given how Lyra is built, functions are totally isolated and tree-shakable, so it's up to the user to decide if they want to import the WASM implementation or use the default JS fallback.
Pros for making search async by default:
- The WASM implementation is a LOT faster (official benchmark will come soon), so the throughput might not be a real problem IMHO
Cons of making search async:
- Would force the user to download ~100kb WASM file, not a problem on the server but I can see why people might not like this on a browser.
I think maintaining two separate search functions could also be a problem if we don't split up the function's internals, but by doing that, we will likely slow down the overall execution by introducing more context-switching and function calls.
Let me know what you think folks 🙏
The problem is not with the breaking change per se, but with the fact that using a promise could reduce the throughput when running multiple search queries in a row.
How would it reduce the throughput?
Also, my 2 cents is to introduce the breaking change by making search return a promise. However, it doesn't mean it will be 100% async. Even returning a promise I think when no WASM it will be synchronous.
I'd go for the async search, I dont see why performance would be inferior. When I changed the tree I was wondering to propose it because I wanted to make findAllWords async. Lets benchmark the current implementation with async and see the difference.
@RafaelGSS
How would it reduce the throughput?
There's a lot of literature regarding promises performances, and I am wondering if that is a legit concern for Lyra; just to name a couple of great articles (there are also benchmarks in there):
- https://madelinemiller.dev/blog/javascript-promise-overhead
- https://fusebit.io/blog/promise-performance-node
But you're the expert here Rafael, I trust you 🙂
I know @ShogunPanda has a different opinion on promises though, so let's wait for him too.
One other concern I have is about DX and consistency: Lyra will expose the following fundamental functions:
create(sync)insert(sync)delete(sync)search(async)insertBatch(async)
While the reason why insertBatch is async would be clear to the user (as stated in the docs, it will avoid event loop freezes), it's not immediately clear why we can create a new db, insert, delete data synchronously, but we need to search asynchronously. I think having a consistent approach in all the "key" functions would be better.
About @marco-ippolito comment: I think we could easily make a couple of search internals async as well. So yes, that's another good point for async search.
That said, I'd also personally prefer making search async, these are just my concerns.
I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.
I think we should go with an approach similar to fastify and other packages: the search function might accept the intersectTokenScores as a function and, based on it, decided whether it should return a promise.
On the caller side, the developer can either choose to always await or to detect it.
In other words, the function will become something like this (pseudocode):
function search<T>(/* */): T | Promise<T> {
// Do something
const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments
if(typeof intersected.then === function) {
return intersected.then(sets => finalizeSearch(sets))
} else {
return finalizeSearch(intersected)
}
}
@ShogunPanda my main concern with this approach is that it can be confusing for some developers... here's my proposal.
I'd ship WASM as an experimental feature for now, that can be enabled in the following way:
import { create } from '@lyrasearch/lyra'
const db = create({
schema: { foo: 'string' },
optimizations: {
wasm: true
}
})
The optimizations property will contain all the code optimizations that we will perform on Lyra. While in an experimental status, the wasm property will be false by default.
Given that we always pass a Lyra instance to the search function, we can easily detect whether the wasm option is set as true or false:
import { create } from '@lyrasearch/lyra'
const db = create({
schema: { foo: 'string' },
optimizations: {
wasm: true
}
})
search(db, { term: 'foo' }) // <--- search knows if wasm interop is enabled
Given that any Lyra instance is essentially an object, we can override this value with ease on demand:
import { create } from '@lyrasearch/lyra'
const db = create({
schema: { foo: 'string' },
optimizations: {
wasm: true
}
})
db.optimizations.wasm = false
I'm not sure I like it, but might be useful for benchmarks and tests while in an experimental stage.
With that being said, we could easily change the search type signature from T to Promise<T> depending on the Lyra instance passed as a first argument, but I see a bad DX pattern there. By mutating the original optimization.wasm parameter, we would make the signature uncertain and possibly non-deterministic.
I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.
I already anticipated this to @micheleriva directly, but I'm leaving my opinion here just in case.
I think we should go with an approach similar to fastify and other packages: the search function might accept the
intersectTokenScoresas a function and, based on it, decided whether it should return a promise. On the caller side, the developer can either choose to alwaysawaitor to detect it.In other words, the function will become something like this (pseudocode):
function search<T>(/* */): T | Promise<T> { // Do something const intersected = intersectTokenScores(/* ... */); // Note that intersectTokenScores comes from arguments if(typeof intersected.then === function) { return intersected.then(sets => finalizeSearch(sets)) } else { return finalizeSearch(intersected) } }
I'm usually not a big fan of this approach, It's kinda hard to maintain (fast-jwt verify 💀) and to build new components above because you always have to think whether it will return a promise or not. Most of the time you will put an await before independently just not to deal with the dual type return.
I'd propose then to either choose to move on with a Promise by default or create an asyncSearch function to take care of this.
I'd choose whether to go promise by default or asyncSearch based on the benchmark results.
@micheleriva @marco-ippolito I see both your points. Let's try something different.
What about exposing two different interfaces and forbid passing optimizations to the regular one?
Something like allowing both the following codes to be valid:
import { create, search } from '@lyrasearch/lyra'
/*
This create does not allow for optimizations.
We specifically validate this in Javascript (instead of relying on TS types) so that we
can guide the user in picking the right one.
*/
const db = create(/* ... */)
const results = search(db)
and
import { create, search } from '@lyrasearch/lyra/async'
/*
Technically create should not be async (yet).
But since we are establishing the new interface we declare as async now so
that we don't need a breaking change later.
Create here accepts optimizations.
*/
const db = await create(/* ... */)
const results = await search(db)
At the beginning of each methods for both version I would add a quick check to avoid mixing API. This way the DX should be fine, and we will have room for future expansion. Moreover it will encourage us to write composable code.
@ShogunPanda not a bad idea. So we'd basically alias the methods right?
Pretty much. In the async version we allow most parts (intersecter, tokenizer, stemmer and so forth) to be pluggable and async.
@ShogunPanda I love this idea. it's like the require('fs').promises
@ShogunPanda I love this idea. it's like the
require('fs').promises
How do you know where I copied the idea from? 😁
@jkomyno I keep on getting the following error when importing the compiled JS binding:
1) tests/lyra.dataset.test.ts lyra.dataset should correctly populate the database with a large dataset `unwrap_throw` failed:
Error: `unwrap_throw` failed
at module.exports.__wbindgen_throw (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:178:9)
at wasm://wasm/000671c6:wasm-function[126]:0x13c5f
at wasm://wasm/000671c6:wasm-function[8]:0x51f5
at Object.module.exports.intersectTokenScores (src/wasm/artifacts/nodejs/lyra_utils_wasm.js:128:20)
at intersectTokenScores (src/wasm/loader.ts:8:24)
at search (src/lyra.ts:673:57)
at Test.<anonymous> (tests/lyra.dataset.test.ts:71:28)
Here is the full CI log: https://github.com/LyraSearch/lyra/actions/runs/3628897502/jobs/6120466268#step:5:627
any idea?
Update: we decided that ALL the Lyra functions will be async. Currently working on this.
All Lyra functions are now async. WASM support will be experimental, and users will have to opt-in via the following interface:
import { create } from '@lyrasearch/lyra'
import { intersectTokenScores } from '@lyrasearch/lyra/dist/esm/wasm/intersectTokenScores' // placeholder
await create({
schema: {
foo: 'string'
},
components: {
algorithms: {
intersectTokenScores
}
}
})
This will provide a unified interface that will allow people to bring their own optimizations, and optionally opt-in for the built-in ones.
Gonna merge this and provide a separate build system for our WASM optimizations.
Thank you all folks, what a ride!