The charset definition seems to depend on the ordering of codes

Open bcat-eu opened this issue 4 years ago • 0 comments

Describe the bug
Listing character codes in charset definition does not always work and changing their order affects their behavior.

To Reproduce
Here we create the table and add umlauts as mentioned here https://github.com/manticoresoftware/manticoresearch/issues/584#issuecomment-867758309

I skipped the lemmatizer due to #588 and index_exact_words cause it would be ignored due to missing morphology, but the same can be reproduced with German stemming / lemmatizer and index_exact_words.

The umlaut codes are ordered by value (ascending) as in the above ticket.

mysql> create table test(title text) min_infix_len='3' expand_keywords='1' charset_table='non_cjk, U 00E4, U 00C4->U 00E4, U 00F6, U 00D6->U 00F6, U 00FC, U 00DC->U 00FC, U 00DF, U 1E9E->U 00DF';  
Query OK, 0 rows affected (0,00 sec)

Make a query CALL KEYWORDS with all the umlauts:

mysql> CALL KEYWORDS('üäöß*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------ ------ ------   
| qpos | tokenized | normalized | docs | hits |  
 ------ ----------- ------------ ------ ------   
| 1    | üäoß*     | üäoß*      | 0    | 0    |  
 ------ ----------- ------------ ------ ------   
1 row in set (0,00 sec)

It delivers no results but the original query contains all but one umlauts, the "ö" is cast to its plain alternative.

Capitalized version works though and is also successfully cast to its lower case version:

mysql> CALL KEYWORDS('ÜÄÖ*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------ ------ ------   
| qpos | tokenized | normalized | docs | hits |  
 ------ ----------- ------------ ------ ------   
| 1    | üäö*      | üäö*       | 0    | 0    |  
 ------ ----------- ------------ ------ ------   
1 row in set (0,00 sec)

So in the above charset declaration U 00F6 part doesn't seem to work but U 00D6->U 00F6 works.

I can also add some content with that umlaut:

mysql> insert into test values(1,'Wer möchte nach Österreich fahren');  
Query OK, 1 row affected (0,00 sec)

The CALL KEYWORDS query can find the word that contains lowercase "ö", but it is cast to "o":

mysql> CALL KEYWORDS('möch*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------ ------ ------   
| qpos | tokenized | normalized | docs | hits |  
 ------ ----------- ------------ ------ ------   
| 1    | moch*     | mochte     | 1    | 1    |  
 ------ ----------- ------------ ------ ------   
1 row in set (0,00 sec)

Using it to find uppercase version wouldn't work:

mysql> CALL KEYWORDS('öst*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------ ------ ------   
| qpos | tokenized | normalized | docs | hits |  
 ------ ----------- ------------ ------ ------   
| 1    | ost*      | ost*       | 0    | 0    |  
 ------ ----------- ------------ ------ ------   
1 row in set (0,00 sec)

And using uppercase version works and is cast to lowercase:

mysql> CALL KEYWORDS('Öst*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------- ------ ------   
| qpos | tokenized | normalized  | docs | hits |  
 ------ ----------- ------------- ------ ------   
| 1    | öst*      | österreich  | 1    | 1    |  
 ------ ----------- ------------- ------ ------   
1 row in set (0,00 sec)

If we take original definition (from the above ticket) 'non_cjk, U 00E4, U 00C4->U 00E4, U 00F6, U 00D6->U 00F6, U 00FC, U 00DC->U 00FC, U 00DF, U 1E9E->U 00DF' and push the U 00F6, U 00D6->U 00F6 part to the end so the definition looks like that 'non_cjk, U 00E4, U 00C4->U 00E4, U 00FC, U 00DC->U 00FC, U 00DF, U 1E9E->U 00DF, U 00F6, U 00D6->U 00F6' then it works:

mysql> DROP TABLE test;  
Query OK, 0 rows affected (0,01 sec)  
  
mysql> create table test(title text) min_infix_len='3' expand_keywords='1' charset_table='non_cjk, U 00E4, U 00C4->U 00E4, U 00FC, U 00DC->U 00FC, U 00DF, U 1E9E->U 00DF, U 00F6, U 00D6->U 00F6';  
Query OK, 0 rows affected (0,00 sec)  
  
mysql> CALL KEYWORDS('üäöß*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------ ------ ------   
| qpos | tokenized | normalized | docs | hits |  
 ------ ----------- ------------ ------ ------   
| 1    | üäöß*     | üäöß*      | 0    | 0    |  
 ------ ----------- ------------ ------ ------   
1 row in set (0,00 sec)  
  
mysql> insert into test values(1,'Wer möchte nach Österreich fahren');  
Query OK, 1 row affected (0,00 sec)  
  
mysql> CALL KEYWORDS('möch*', 'test', 1 as stats, 'hits' as sort_mode);  
 ------ ----------- ------------ ------ ------   
| qpos | tokenized | normalized | docs | hits |  
 ------ ----------- ------------ ------ ------   
| 1    | möch*     | möchte     | 1    | 1    |  
 ------ ----------- ------------ ------ ------   
1 row in set (0,00 sec)

Am I missing something here? The fix seems a bit magical to me :)

Expected behavior
I expect the order in which I list character codes to have no effect.

Describe the environment:

searchd -v  
Manticore 3.6.0 96d61d8bf@210504 release

lsb_release -a    
No LSB modules are available.    
Distributor ID:	Ubuntu    
Description:	Ubuntu 20.04.2 LTS    
Release:	20.04    
Codename:	focal

Jul 01 '21 15:07 bcat-eu