doris-thirdparty icon indicating copy to clipboard operation
doris-thirdparty copied to clipboard

[feature](analysis) add new chinese tokenizer IK

Open Ryan19929 opened this issue 11 months ago • 0 comments

Support IK tokenizer for inverted index: Migrate analysis-ik from Java to C++, Implement basic tokenization functionality. The major differences from the original Java code are as follows:

  1. Encoding Format Difference: Use /jieba/Unicode.hpp to process characters in IK-C++.
  2. Memory Management Optimization: Add a custom allocator to avoid performance overhead caused by frequent memory allocation in STL containers.
  3. Remote Dictionary Support: IK-C++ does not currently support remote dictionaries.

Major changes to the original code:

  1. testChinese.cpp: Add test for testing Chinese tokenization speed. Use the dataset located at /src/test/data/contribs-lib/analysis/chinese/speed-test-text.txt (红楼梦) for testing.
  2. LanguageBasedAnalyzer.h/cpp: Add IK tokenizer configuration, initialization entry, and dictionary loading logic. Add the IK tokenization mode entry (temporary mode entry) in AnalyzerMode.

Ryan19929 avatar Jan 02 '25 07:01 Ryan19929