Lack of i18n support for full-text search in encrypted rooms
Steps to reproduce
- Change the language to English
- Click Settings -> Security and Privacy -> Message search -> Disable
- Click Settings -> Security and Privacy -> Message search -> Enable
- Create a new room and post a Japanese message
この部屋は全文検索用のテスト. - Wait until all the rooms are indexed in Settings -> Security and Privacy -> Message search.
- Search
部屋in the room.
Outcome
What did you expect?
The word 部屋 is found.
What happened instead?
No result found.
My thoughts
This appears to stem from the lack of i18n support for full-text search in encrypted rooms. The full-text search is backed by Matrix-Seshat. I looked into its source code and found that a custom tokenizer is not supported for Japanese (or many other Asian languages). https://github.com/matrix-org/seshat/blob/main/src/config.rs#L171
Furthermore, the language is not passed from Element desktop to Matrix-seshat: https://github.com/element-hq/element-desktop/blob/develop/src/seshat.ts#L112
If my understanding is correct, there are two problems:
- No tokenizer for Asian languages
- Correct language is not passed to Matrix-Seshat
I am using multiple languages (English and Japanese) in many rooms. That means the language may differ from message to message (from room to room). To support this case, either of the following options seems to be reasonable:
- For each event/message, we detect the language and use the corresponding tokenizer.
- Use an NN-based multilingual tokenizer
- Use an advanced full-search library with CJK support in JS, e.g., FlexSearch
I am a computational physicist & scientific programmer, but I am not familiar with this kind of business. Any help or suggestions would be greatly appreciated.
Operating system
Mac
Application version
Element Desktop 1.11.101
How did you install the app?
From the official website
Homeserver
matrix.org
Will you send logs?
No
I have played around with flexsearch. The results look OK.
const { Index } = require("flexsearch");
const index = new Index({
tokenize: "full", // Split by each character (for CJK)
encode: false, // No normalization (safe for mixed languages)
});
// some test data
const data = [
'cute cats',
'cats abcd efgh ijkl mnop qrst uvwx cute',
'日本語は難しいよね', // Japanese (Japanese is difficult)
'test用の部屋だよ', // Japanese + English (This is a test room)
'안녕하세요 반갑습니다', // Korean (Hello, nice to meet you)
'한국어 공부하기', // Korean (Studying Korean)
'你好,世界', // Chinese (Hello, world)
'学习中文', // Chinese (Learning Chinese)
];
// add data to the index
data.forEach((item, id) => {
index.add(id, item);
});
// perform query
const searchTerms = ["部屋", "日本語", "cats", "難", "안녕", "你好", "学习"];
searchTerms.forEach(term => {
console.log(`\nSearching for: "${term}"`);
const result = index.search(term);
console.log("Results:");
result.forEach(i => {
const text = data[i];
const position = text.indexOf(term);
console.log(`[${i}] "${text}" (position: ${position})`);
});
});
Searching for: "部屋"
Results:
[3] "test用の部屋だよ" (position: 6)
Searching for: "日本語"
Results:
[2] "日本語は難しいよね" (position: 0)
Searching for: "cats"
Results:
[1] "cats abcd efgh ijkl mnop qrst uvwx cute" (position: 0)
[0] "cute cats" (position: 5)
Searching for: "難"
Results:
[2] "日本語は難しいよね" (position: 4)
Searching for: "안녕"
Results:
[4] "안녕하세요 반갑습니다" (position: 0)
Searching for: "你好"
Results:
[6] "你好,世界" (position: 0)
Searching for: "学习"
Results:
[7] "学习中文" (position: 0)
I have now identified a related Seshat issue, which has been open for six years.:
https://github.com/matrix-org/seshat/issues/7
Can we use a third-party full-search JS library and save the index in a secure way?
It is a shame that even a simple text search does not work for Asian languages, which makes Element almost useless. I would be happy to contribute to the development or fix it. Please, anyone, review my report.
https://github.com/matrix-org/seshat/blob/main/CHANGELOG.md#400---2024-06-07 explicitly removed support for Japanese, the fix would need to be on that side of things.
@t3chguy
Thank you for the information. It seems that the removal of Japanese support was due to technical reasons. However, I’m not sure if simply reintroducing support would be sufficient for the broader Matrix community.
I’ve read the discussion on the other side (though I’m not sure why that issue page is currently inactive).
As far as I can see, we have two possible directions:
- Introduce language-specific tokenizers for Asian languages, or
- Switch to a search engine based on N-grams.
The second option has the advantage of not requiring language-specific tokenizers. (I’m not a specialist in search engine design, but I assume this may increase memory usage?)
We could implement these improvements either by extending Matrix-Seshat, or by switching to a JavaScript-based search engine altogether.
Would it make sense to open a new issue on the Matrix-Seshat side? Or do you think it’s better to continue the discussion here, given that this ties into the broader design of Element?
We could implement these improvements either by extending Matrix-Seshat, or by switching to a JavaScript-based search engine altogether.
One of the requirements is that the search index is encrypted at rest, which may make this approach a little harder.
We're not looking at switching away from Seshat at this moment in time, but improvements to it are always welcome. In theory the Element X mobile apps (via the Rust SDK) may also begin using Seshat.
the Element X mobile apps (via the Rust SDK) may also begin using Seshat
That makes sense. In that case, do you think search based on morphological analysis would be preferable, or would a simple N-gram-based approach be sufficient?
That'd be a question for the Seshat project or element-meta
Thank you for the suggestion. Let me open an issue in element-meta.
we need a PR to fix this
@t3chguy
My collaborators and I are considering creating a patch or pull request (PR) for Seshat. Before we proceed, we’d like to know who is responsible for reviewing pull requests (PRs). We want to avoid spending time on a contribution that may not be reviewed. The discussion on element-meta has not been active.
@shinaoka I suggest asking the Matrix room mentioned in its README.