a more scalable full text search via SQLite

Open phiresky opened this issue 3 years ago • 1 comments

I implemented a POC for SQLite-based full text search.

You can try a demo here:

https://phiresky.github.io/replayweb.page/?source=https://phiresky.github.io/replayweb.page/examples/netpreserve-twitter-4k-minimal.wacz#view=pages&query=фізичної+юридичної

It's implemented in the service worker. The api is a GET to /w/api/c/<collectionid>/sqliteFtsSearch?matchString=foo&limit=10 which responds with a ndjson stream of rows and progress events:

This means you can start showing results as more are found, before the search is complete.

If there's an error, a {"type": "error", "message": "..."} object is returned in the stream.

Here's the changes made to the frontend (outside of this PR): https://github.com/phiresky/replayweb.page/commit/202673f6f544b2afa385efd4d4acb67a508520b3

Notes:

the script to convert a pages.jsonl to a pages-fts.sqlite3 file is in [src/sqlite-fts/create-sqlite-fts.ts].
i added the full text index for "dako-gov-ua.wacz" to the netpreserve-twitter wacz instead so I could push it to github as an example - the search works but the warc files are obviously missing.
i did not implement detection of presence of the sqlite index. right now it just assumes /pages/extraPages-fts.sqlite3 exists.
right now everything stays cached in ram, there needs to be some eviction (right now you need to evict by unloading the service worker manually)
right now I directly pass the query to the sqlite MATCH syntax. that means the following mean different things:
- "foo bar" vs foo bar
- there's more specific syntax for prefixes etc, see https://www.sqlite.org/fts5.html
- probably for actual use we want to always just search for a set of single words (not phrases)
I created two variants: a "minimal" FTS index and a "full" index. the minimal index only allows for searching for any number of words in the page content but can't filter by closeness of words, by occurence count, sort by BM25 relevance, or even display the page contents.
We'll need a larger pages.jsonl to really evaluate this. The ones you sent are too small. With the pages.jsonl from dako-gov-ua.wacz these are the stats:
- original extraPages.jsonl: 49MB
- extraPages.jsonl(zipped): 3.5MB
- 4kB pages, minimal index (search in page content but can only show id, title and url, not page content itself):
  - size of extraPages-fts.sqlite3: 2.2MB
  - search for це запити фізичної юридичної: 16 requests, 127kB
  - search for ВИДИ ЗАПИТІВ: 13 requests, 111 kB
  - search for Соціально правові after searching for ВИДИ ЗАПИТІВ: 3 requests, 20kB total. this shows that after the initial query a lot of needed data is already fetched, even when the text is completely different. for more similar queries it has to fetch even less (or nothing).
- 4kB pages, full index:
  - size of extraPages-fts.sqlite3: 30.5MB
  - link: https://phiresky.github.io/replayweb.page/?source=https://phiresky.github.io/replayweb.page/examples/netpreserve-twitter-4k-full.wacz#view=pages&query=%D1%84%D1%96%D0%B7%D0%B8%D1%87%D0%BD%D0%BE%D1%97+%D1%8E%D1%80%D0%B8%D0%B4%D0%B8%D1%87%D0%BD%D0%BE%D1%97
  - here you can actually see it dynamically loading more results because it's slower
  - search for це запити фізичної юридичної: 36 requests, 332kB
  - search for ВИДИ ЗАПИТІВ: 34 requests, 319 kB
  - search for "Соціально-правові" after searching for ВИДИ ЗАПИТІВ: 4 requests, 29kB total.
- 32kB pages, full index:
  - size of extraPages-fts.sqlite3: 30.5MB
  - search for це запити фізичної юридичної: 19 requests, 635kB
  - search for ВИДИ ЗАПИТІВ: 16 requests, 537kB
  - search for "Соціально-правові" after searching for ВИДИ ЗАПИТІВ: 4 requests, 130kB.
I had to modify the wa-sqlite library to add -DSQLITE_ENABLE_FTS5 to WASQLITE_DEFINES, that's why I included a wasm binary of that library.

Apr 10 '22 19:04 phiresky

By the way (just thought of this), if the minimal index is enough then another option for "medium-sized" archives is to simply compress the sqlite3 file, download it completely, and load it as an in-memory database. That would remove most of the complexity of this code.

The compressed sqlite3 file for dako.gov.ua is 400kB compared to 3.5MB for your extraPages.jsonl file. It can't show snippets, but it can still search with "ok" quality.

Apr 12 '22 09:04 phiresky