element-desktop icon indicating copy to clipboard operation
element-desktop copied to clipboard

Lack of i18n support for full-text search in encrypted rooms

Open shinaoka opened this issue 7 months ago • 12 comments

Steps to reproduce

  1. Change the language to English
  2. Click Settings -> Security and Privacy -> Message search -> Disable
  3. Click Settings -> Security and Privacy -> Message search -> Enable
  4. Create a new room and post a Japanese message この部屋は全文検索用のテスト.
  5. Wait until all the rooms are indexed in Settings -> Security and Privacy -> Message search.
  6. Search 部屋 in the room.

Outcome

What did you expect?

The word 部屋 is found.

What happened instead?

No result found.

My thoughts

This appears to stem from the lack of i18n support for full-text search in encrypted rooms. The full-text search is backed by Matrix-Seshat. I looked into its source code and found that a custom tokenizer is not supported for Japanese (or many other Asian languages). https://github.com/matrix-org/seshat/blob/main/src/config.rs#L171

Furthermore, the language is not passed from Element desktop to Matrix-seshat: https://github.com/element-hq/element-desktop/blob/develop/src/seshat.ts#L112

If my understanding is correct, there are two problems:

  • No tokenizer for Asian languages
  • Correct language is not passed to Matrix-Seshat

I am using multiple languages (English and Japanese) in many rooms. That means the language may differ from message to message (from room to room). To support this case, either of the following options seems to be reasonable:

  • For each event/message, we detect the language and use the corresponding tokenizer.
  • Use an NN-based multilingual tokenizer
  • Use an advanced full-search library with CJK support in JS, e.g., FlexSearch

I am a computational physicist & scientific programmer, but I am not familiar with this kind of business. Any help or suggestions would be greatly appreciated.

Operating system

Mac

Application version

Element Desktop 1.11.101

How did you install the app?

From the official website

Homeserver

matrix.org

Will you send logs?

No

shinaoka avatar Jun 02 '25 20:06 shinaoka

I have played around with flexsearch. The results look OK.

const { Index } = require("flexsearch");

const index = new Index({
  tokenize: "full",          // Split by each character (for CJK)
  encode: false,             // No normalization (safe for mixed languages)
});

// some test data
const data = [
    'cute cats',
    'cats abcd efgh ijkl mnop qrst uvwx cute',
    '日本語は難しいよね', // Japanese (Japanese is difficult)
    'test用の部屋だよ', // Japanese + English (This is a test room)
    '안녕하세요 반갑습니다', // Korean (Hello, nice to meet you)
    '한국어 공부하기', // Korean (Studying Korean)
    '你好,世界', // Chinese (Hello, world)
    '学习中文', // Chinese (Learning Chinese)
];

// add data to the index
data.forEach((item, id) => {
    index.add(id, item);
});

// perform query
const searchTerms = ["部屋", "日本語", "cats", "難", "안녕", "你好", "学习"];

searchTerms.forEach(term => {
    console.log(`\nSearching for: "${term}"`);
    const result = index.search(term);
    console.log("Results:");
    result.forEach(i => {
        const text = data[i];
        const position = text.indexOf(term);
        console.log(`[${i}] "${text}" (position: ${position})`);
    });
});
Searching for: "部屋"
Results:
[3] "test用の部屋だよ" (position: 6)

Searching for: "日本語"
Results:
[2] "日本語は難しいよね" (position: 0)

Searching for: "cats"
Results:
[1] "cats abcd efgh ijkl mnop qrst uvwx cute" (position: 0)
[0] "cute cats" (position: 5)

Searching for: "難"
Results:
[2] "日本語は難しいよね" (position: 4)

Searching for: "안녕"
Results:
[4] "안녕하세요 반갑습니다" (position: 0)

Searching for: "你好"
Results:
[6] "你好,世界" (position: 0)

Searching for: "学习"
Results:
[7] "学习中文" (position: 0)

shinaoka avatar Jun 02 '25 21:06 shinaoka

I have now identified a related Seshat issue, which has been open for six years.:

https://github.com/matrix-org/seshat/issues/7

Can we use a third-party full-search JS library and save the index in a secure way?

shinaoka avatar Jun 02 '25 22:06 shinaoka

It is a shame that even a simple text search does not work for Asian languages, which makes Element almost useless. I would be happy to contribute to the development or fix it. Please, anyone, review my report.

shinaoka avatar Jun 04 '25 03:06 shinaoka

https://github.com/matrix-org/seshat/blob/main/CHANGELOG.md#400---2024-06-07 explicitly removed support for Japanese, the fix would need to be on that side of things.

t3chguy avatar Jun 04 '25 07:06 t3chguy

@t3chguy

Thank you for the information. It seems that the removal of Japanese support was due to technical reasons. However, I’m not sure if simply reintroducing support would be sufficient for the broader Matrix community.

I’ve read the discussion on the other side (though I’m not sure why that issue page is currently inactive).

As far as I can see, we have two possible directions:

  1. Introduce language-specific tokenizers for Asian languages, or
  2. Switch to a search engine based on N-grams.

The second option has the advantage of not requiring language-specific tokenizers. (I’m not a specialist in search engine design, but I assume this may increase memory usage?)

We could implement these improvements either by extending Matrix-Seshat, or by switching to a JavaScript-based search engine altogether.

Would it make sense to open a new issue on the Matrix-Seshat side? Or do you think it’s better to continue the discussion here, given that this ties into the broader design of Element?

shinaoka avatar Jun 04 '25 07:06 shinaoka

We could implement these improvements either by extending Matrix-Seshat, or by switching to a JavaScript-based search engine altogether.

One of the requirements is that the search index is encrypted at rest, which may make this approach a little harder.

We're not looking at switching away from Seshat at this moment in time, but improvements to it are always welcome. In theory the Element X mobile apps (via the Rust SDK) may also begin using Seshat.

t3chguy avatar Jun 04 '25 08:06 t3chguy

the Element X mobile apps (via the Rust SDK) may also begin using Seshat

That makes sense. In that case, do you think search based on morphological analysis would be preferable, or would a simple N-gram-based approach be sufficient?

shinaoka avatar Jun 04 '25 08:06 shinaoka

That'd be a question for the Seshat project or element-meta

t3chguy avatar Jun 04 '25 08:06 t3chguy

Thank you for the suggestion. Let me open an issue in element-meta.

shinaoka avatar Jun 04 '25 09:06 shinaoka

we need a PR to fix this

gmanskibiditoilet avatar Jun 05 '25 02:06 gmanskibiditoilet

@t3chguy

My collaborators and I are considering creating a patch or pull request (PR) for Seshat. Before we proceed, we’d like to know who is responsible for reviewing pull requests (PRs). We want to avoid spending time on a contribution that may not be reviewed. The discussion on element-meta has not been active.

shinaoka avatar Jun 16 '25 04:06 shinaoka

@shinaoka I suggest asking the Matrix room mentioned in its README.

t3chguy avatar Jun 16 '25 07:06 t3chguy