chroma [Bug]: Querying with certain foreign language data is failing to return correctly with sqlite3 fts5 tokenizer='trigram'

What happened?

For certain Asian languages e.g. Korean, (and potentially Chinese/Japanese), sqlite3 tokenizers don't quite work well without ICU support.

So testing it out on the sqlite file with fts5 tokenizer=trigram, it looks like trigrams are working correctly to catch the data whenever at least 3 characters are searched for:

SELECT * FROM embedding_fulltext_search WHERE string_value LIKE '%순매수%';

returns: 주요 신흥 4개국 증시 외국인투자자 순매수액

but

SELECT * FROM embedding_fulltext_search WHERE string_value LIKE '%순매%';

doesn't return anything.

having tried fts5 porter and unicode61 tokenizers, they only catch terms with white space separators: CREATE VIRTUAL TABLE tascii USING fts5(x, tokenize = 'ascii'); CREATE VIRTUAL TABLE tuni USING fts5(x, tokenize = 'unicode61');

select * from tuni('신흥'); select * from tascii ('신흥');

returns 주요 신흥 4개국 증시 외국인투자자 순매수액

correctly but

select * from tuni('순매'); select * from tascii('순매');

fails to return the same data.

But fortunately, when using the above WHERE x LIKE query that fails for trigrams it returns correctly:

select * from tuni WHERE x LIKE '%순매%'; select * from tascii WHERE x LIKE '%순매%';

returns

주요 신흥 4개국 증시 외국인투자자 순매수액

potential fix for queries failing to return the correct data for the above languages might include switching fts5 tokenizers and possibly the way they are queried. One problem being I don't know how this will impact performance or other languages.

Best case scenario is somehow integrating ICU tokenizer into chromadb sqlite usage but it requires larger effort to do so.

Versions

chroma 0.4.8 python 3.10.11

Relevant log output

No response

Sep 01 '23 06:09 wojangAI

Thanks, @wojang-ziumks; I think we need to think this through, but generally, adding unicode61 tokenizer shouldn't have much impact on other language supports, maybe just a little bit of additional CPU overhead.

Sample for the change:

CREATE VIRTUAL TABLE embedding_fulltext USING fts5(id, string_value,tokenize = 'unicode61');

I can do the change and run some local tests to verify the above.

Sep 01 '23 07:09 tazarov

@wojang-ziumks, I had a discussion with @HammadB about this. And while the suggested solution above works the issue is that if we outright add this, it will break existing deployments. So now we're considering how we can let users pick their own sqlite tokenizer and possible migration paths.

Sep 06 '23 10:09 tazarov

@jeffchuber This is the same as supporting custom indices

Sep 13 '23 22:09 HammadB

Ideally this will be part of #1125

Sep 18 '23 20:09 tazarov

It’s been a year, and the issue still exists.

Aug 30 '24 06:08 h3clikejava

@h3clikejava, @wojang-ziumks, I know it's been a while, but I think I have a solution to your issue.

I've recently added support for changing the FTS tokenizer in Chroma to chroma-ops package.:

chops fts rebuild --tokenizer unicode61 /path/to/persist_dir

You will only have to do the above once, and any subsequent upgrade to your Chroma (assuming the same persistent dir) will ba carried over with the same tokenizer.

[!NOTE] Another important observation I've come across is that trigram tokenizer seems to work ok with newer versions of sqlite3 e.g. 3.43.x or later.

Jan 15 '25 18:01 tazarov

Wow that's amazing to hear! I know I've been busy with others things lately and haven't had the chance to really dig into chroma but I REALLY appreciate your work in this @tazarov.

And just in time to actually incorporate it into a project I'm working on right now too!

I'll try it out to see how it goes!

Jan 16 '25 02:01 wojangAI

chroma chroma copied to clipboard

[Bug]: Querying with certain foreign language data is failing to return correctly with sqlite3 fts5 tokenizer='trigram'

What happened?

Versions

Relevant log output

chroma
chroma copied to clipboard