doris-thirdparty
doris-thirdparty copied to clipboard
[feature](analysis) add new chinese tokenizer IK
Support IK tokenizer for inverted index: Migrate analysis-ik from Java to C++, Implement basic tokenization functionality. The major differences from the original Java code are as follows:
- Encoding Format Difference: Use /jieba/Unicode.hpp to process characters in IK-C++.
- Memory Management Optimization: Add a custom allocator to avoid performance overhead caused by frequent memory allocation in STL containers.
- Remote Dictionary Support: IK-C++ does not currently support remote dictionaries.
Major changes to the original code:
- testChinese.cpp: Add test for testing Chinese tokenization speed. Use the dataset located at
/src/test/data/contribs-lib/analysis/chinese/speed-test-text.txt(红楼梦) for testing. - LanguageBasedAnalyzer.h/cpp:
Add IK tokenizer configuration, initialization entry, and dictionary loading logic.
Add the IK tokenization mode entry (temporary mode entry) in
AnalyzerMode.