full-text-tabs-forever
full-text-tabs-forever copied to clipboard
Discussion: Semantic Search instead of Full-Text-Search
Hi Ian, great project. It's in my todo list of projects to build but thank god you got this done :)
I was wondering: wouldn't be better to use semantic search instead of full text search?
At least this was my idea for creating a project similar to yours.
I'd be glad to give more details, if my question is not clear.
(also interested in contributing, if you want to go in this direction)
Hey @alexferrari88 thanks for the kinds words. Agreed, semantic search would be great. I've thought about it as well, but unlike sqlite-in-wasm, I'm unaware of a solution to semantic search in JS.
Running a vector store remotely is an option but one of my goals with this project was to have it be useful as a standalone product.
What are your thoughts on how to implement semantic search?
Right after submitting the issue, I started looking for a wasm vector search, since — I agree with you — it would be nicer to have this extension be sort of self-contained.
Unfortunately, the solutions are still few and far between. The best solutions I found so far:
Of the two, voy seems quite nice and there are also JS examples on how to use.
Curious to know your thoughts about this.
Awesome, thanks for the links. After a quick look i have some thoughts:
- Voy is currently an in-memory store which means we'd have to load everything into memory and initialize the index. This will work, but is not ideal since the amount of full-text data is unbounded and assumedly will be measured in gigabytes once the user has browsed for long enough.
- Victor looks pretty ideal in that it uses OPFS for storage in the browser. However, OPFS does not currently work in the background thread of web extensions. More details below.
I initially created this extension with WebSQL, which works for extensions using manifest v2. MV2 extensions are no longer allowed though, so while porting to MV3 I initially wanted to use OPFS and the official sqlite-wasm implementation.
I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.
I ended up using IndexedDB as the backing filesystem via the excellent wa-sqlite implementation. That's the current state of things -- Using IndexedDB because it happens to work in service workers.
thank you for looking more into it. That's unfortunate ☹️. Having a standalone extension that would take care of everything would be ideal, without the user having to install extra stuff but it seems like it is not feasible at the moment.
Ideally, one could proceed with an external (but local) vector store (e.g. Chromadb) and create a repository layer that would allow an easy swap for a wasm implementation in the future. I understand this is completely outside the scope of this extension.
I might fork it and start working on it but can't promise anything 😎
I was unable to get OPFS to work in the web extension service worker. It works in browser tabs, and in normal web workers, but specifically in the background service worker that replaced background scripts in MV3 it would not work. At the time it seemed to be unintentional, i.e. a bug in the chrome implementation. So perhaps its now possible.
Technically (and pedantically) speaking, OPFS should work in any context, including service workers. The restriction is the OPFS synchronous file access handles that make OPFS file operations fast are only available in dedicated workers. That is a deliberate choice, not a bug - the rationale is that blocking calls should not be used anywhere else.
For Chrome extensions, although they are implemented as service workers, I think there is a workaround. An offscreen document can be attached to an extension, and this document can create a Worker where the entire OPFS API should be usable. Perhaps that path is worth exploring.
Thanks for chiming in @rhashimoto. Interesting, I had looked at the offscreen document API for dom parsing but if it allows access to a normal worker that might be an option. A bit roundabout, but vector search for browsing history may well be worth it.
There is a new, viable option: using pgvector via pglite (https://pglite.dev/extensions/#pgvector). I'm exploring this now.
Here is another much more mature semantic search library and extension, using the latest models available all in-browser (wasm or webgpu are both available): https://github.com/do-me/SemanticFinder
My intent was to move from sqlite in the browser to postgres in the browser, allowing the use of PGVector: https://pglite.dev/extensions/#pgvector.
Thanks for the link though, i'll check it out.
The issue in the past was that whatever vector store we use has to support storying data outside of memory. The database will grow as you browse the web and any in momory store would eventually get overwhelmed by all the data.
Ok I see how PGVector can llow to implement this, thanks a lot!
FYI there is another extension that implemented semantic search for bookmarks: https://github.com/oto-labs/librarian
They BTW mentioned here that they would be interested to implement support for history but they ran into issues with optimization, so maybe you and their team could join forces?