full-text-tabs-forever icon indicating copy to clipboard operation
full-text-tabs-forever copied to clipboard

Discussion: Semantic Search instead of Full-Text-Search

Open alexferrari88 opened this issue 1 year ago • 11 comments

Hi Ian, great project. It's in my todo list of projects to build but thank god you got this done :)

I was wondering: wouldn't be better to use semantic search instead of full text search?

At least this was my idea for creating a project similar to yours.

I'd be glad to give more details, if my question is not clear.

(also interested in contributing, if you want to go in this direction)

alexferrari88 avatar Dec 15 '23 08:12 alexferrari88

Hey @alexferrari88 thanks for the kinds words. Agreed, semantic search would be great. I've thought about it as well, but unlike sqlite-in-wasm, I'm unaware of a solution to semantic search in JS.

Running a vector store remotely is an option but one of my goals with this project was to have it be useful as a standalone product.

What are your thoughts on how to implement semantic search?

iansinnott avatar Dec 16 '23 08:12 iansinnott

Right after submitting the issue, I started looking for a wasm vector search, since — I agree with you — it would be nicer to have this extension be sort of self-contained.

Unfortunately, the solutions are still few and far between. The best solutions I found so far:

  1. https://github.com/tantaraio/voy
  2. https://github.com/not-pizza/victor

Of the two, voy seems quite nice and there are also JS examples on how to use.

Curious to know your thoughts about this.

alexferrari88 avatar Dec 16 '23 15:12 alexferrari88

Awesome, thanks for the links. After a quick look i have some thoughts:

  • Voy is currently an in-memory store which means we'd have to load everything into memory and initialize the index. This will work, but is not ideal since the amount of full-text data is unbounded and assumedly will be measured in gigabytes once the user has browsed for long enough.
  • Victor looks pretty ideal in that it uses OPFS for storage in the browser. However, OPFS does not currently work in the background thread of web extensions. More details below.

I initially created this extension with WebSQL, which works for extensions using manifest v2. MV2 extensions are no longer allowed though, so while porting to MV3 I initially wanted to use OPFS and the official sqlite-wasm implementation.

I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.

I ended up using IndexedDB as the backing filesystem via the excellent wa-sqlite implementation. That's the current state of things -- Using IndexedDB because it happens to work in service workers.

iansinnott avatar Dec 17 '23 03:12 iansinnott

thank you for looking more into it. That's unfortunate ☹️. Having a standalone extension that would take care of everything would be ideal, without the user having to install extra stuff but it seems like it is not feasible at the moment.

Ideally, one could proceed with an external (but local) vector store (e.g. Chromadb) and create a repository layer that would allow an easy swap for a wasm implementation in the future. I understand this is completely outside the scope of this extension.

I might fork it and start working on it but can't promise anything 😎

alexferrari88 avatar Dec 18 '23 09:12 alexferrari88

I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.

Technically (and pedantically) speaking, OPFS should work in any context, including service workers. The restriction is the OPFS synchronous file access handles that make OPFS file operations fast are only available in dedicated workers. That is a deliberate choice, not a bug - the rationale is that blocking calls should not be used anywhere else.

For Chrome extensions, although they are implemented as service workers, I think there is a workaround. An offscreen document can be attached to an extension, and this document can create a Worker where the entire OPFS API should be usable. Perhaps that path is worth exploring.

rhashimoto avatar Dec 19 '23 17:12 rhashimoto

Thanks for chiming in @rhashimoto. Interesting, I had looked at the offscreen document API for dom parsing but if it allows access to a normal worker that might be an option. A bit roundabout, but vector search for browsing history may well be worth it.

iansinnott avatar Feb 10 '24 04:02 iansinnott

There is a new, viable option: using pgvector via pglite (https://pglite.dev/extensions/#pgvector). I'm exploring this now.

iansinnott avatar Aug 22 '24 00:08 iansinnott

Here is another much more mature semantic search library and extension, using the latest models available all in-browser (wasm or webgpu are both available): https://github.com/do-me/SemanticFinder

lrq3000 avatar Mar 15 '25 22:03 lrq3000

My intent was to move from sqlite in the browser to postgres in the browser, allowing the use of PGVector: https://pglite.dev/extensions/#pgvector.

Thanks for the link though, i'll check it out.

iansinnott avatar Mar 17 '25 05:03 iansinnott

The issue in the past was that whatever vector store we use has to support storying data outside of memory. The database will grow as you browse the web and any in momory store would eventually get overwhelmed by all the data.

iansinnott avatar Mar 17 '25 05:03 iansinnott

Ok I see how PGVector can llow to implement this, thanks a lot!

FYI there is another extension that implemented semantic search for bookmarks: https://github.com/oto-labs/librarian

They BTW mentioned here that they would be interested to implement support for history but they ran into issues with optimization, so maybe you and their team could join forces?

lrq3000 avatar Mar 17 '25 18:03 lrq3000