Meta: Full text search support
Introduction
As users post messages and the room history grows over time, manually scrolling to find a past conversation becomes inefficient and frustrating.
In today's world, users expect to be able to find past conversations easily and quickly. Adding full text search would greatly improve usability by allowing users to search across messages with filters like sender, date ranges, and on a per-room basis.
Element Desktop has been supporting full text search for quite a while now. It utilizes Seshat which uses a SQLite database in combination with a full-text search index (backed by Tantivy) to provide a simple API to feed events into the database/index and search and retrieve the events.
Plan
The Rust SDK already contains a persistent store for events we encounter. This fact and Seshat itself being stuck on a quite old version of the used full-text search library makes direct adoption of Seshat unfeasible.
Nevertheless, some parts of Seshat will be useful, for example Seshat contains a storage backend for Tantivy which encrypts the index.
This issue lies down the tasks that would be necessary to bring full-text search support into the Rust SDK.
Because we can't adopt Seshat directly, we will begin with the creation of a new crate in this very repository. After that we can add a simple API to index events and search the index.
Once a functioning API exists, we can experiment with the search functionality using benchmarks, small test clients and finally multiverse.
MVP tasks for internship
The following list of tasks lays out the rough plan:
- [x] Create a new create called matrix-sdk-search.
- [x] Implement a basic API to add events to and index using. Tantivy
- [x] Add support to search for events.
- [x] Extend multiverse so we can test out the search API and functionality.
- [x] Utilize the matrix-sdk-search crate in the
EventCache, every time theEventCacheinserts an event, index the event. - [x] Add support for edits and redactions
- [x] Test out and benchmark if a per-room index is better than a global index.
- [x] Pagination
- [x] Implement room spidering so we index the whole room history.
- [x] Fix search on first viewing
- [x] Support encrypted index.
- [x] Update README with information and code examples.
- [x] Record a demo of search in Multiverse.
Nice to have
- [ ] Global search.
- [ ] Background spidering
- [x] Try bulk operations to reduce index commits.
- [ ] Make index writes/commits non-blocking.
- [ ] Add
SearchQueryBuilder - [ ] Add support for a per-room language setting, the setting should be part of the room state. MSC4334 has an unstable ruma impl
Implementation notes
- [x] Redaction must remove the redacted event from the index.
- [ ] Ignoring a user must remove all their sent events from the index.
- [ ] Banning a user could preemptively remove all their sent events from the index.
Notes for the future
- [ ] Indexes are stored in
<index path>/<room id>/but given that different clients could exist this would have to be changed to<index path>/<user id>/<room id>/so that each<index path>/<client id>can be encrypted independantly.
Please consider adding CJK full text search support (in encrypted/non-encrypted rooms).
As discussed in a team sync, I moved spidering in the background to the "Nice to have" section for now. We are fine in the first version to have a blocking UI when logging in Multiverse for indexing the history.
We want it for the final feature but it requires a bit more thinking and product inputs on the how (like what should be the prioritisation between rooms, recency prioritisation, etc) or the when (like should we run on a mobile connection, etc).
We also have a problem on the tech side where we can have only a single instance of a room timeline at the time. The background pagination for spidering may conflict with the back pagination of the displayed timeline.