chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Querying with certain foreign language data is failing to return correctly with sqlite3 fts5 tokenizer='trigram'

Open wojangAI opened this issue 2 years ago • 5 comments

What happened?

For certain Asian languages e.g. Korean, (and potentially Chinese/Japanese), sqlite3 tokenizers don't quite work well without ICU support.

So testing it out on the sqlite file with fts5 tokenizer=trigram, it looks like trigrams are working correctly to catch the data whenever at least 3 characters are searched for:

SELECT * FROM embedding_fulltext_search WHERE string_value LIKE '%순매수%';

returns: 주요 신흥 4개국 증시 외국인투자자 순매수

but

SELECT * FROM embedding_fulltext_search WHERE string_value LIKE '%순매%';

doesn't return anything.

having tried fts5 porter and unicode61 tokenizers, they only catch terms with white space separators: CREATE VIRTUAL TABLE tascii USING fts5(x, tokenize = 'ascii'); CREATE VIRTUAL TABLE tuni USING fts5(x, tokenize = 'unicode61');

select * from tuni('신흥'); select * from tascii ('신흥');

returns 주요 신흥 4개국 증시 외국인투자자 순매수액

correctly but

select * from tuni('순매'); select * from tascii('순매');

fails to return the same data.

But fortunately, when using the above WHERE x LIKE query that fails for trigrams it returns correctly:

select * from tuni WHERE x LIKE '%순매%'; select * from tascii WHERE x LIKE '%순매%';

returns

주요 신흥 4개국 증시 외국인투자자 순매수액

potential fix for queries failing to return the correct data for the above languages might include switching fts5 tokenizers and possibly the way they are queried. One problem being I don't know how this will impact performance or other languages.

Best case scenario is somehow integrating ICU tokenizer into chromadb sqlite usage but it requires larger effort to do so.

Versions

chroma 0.4.8 python 3.10.11

Relevant log output

No response

wojangAI avatar Sep 01 '23 06:09 wojangAI