Analyze: No word segmentation/incorrect tokenization for Chinese
Describe the bug
The analyze module does not perform correct segmentation for Chinese texts. As Chinese does not have any white-space word segmentation, CATMA treats only punctuation symbols as word breaks. So a query like wild="人人" returns only matches if they are surrounded with whitespace or punctuation, which is not normally the case for Chinese texts. Queries like wild="%人人%" then return whole subclauses or phrases that are sandwiched between two punctuation marks, which is obviously also not ideal, as the match is not a "word containing the query" but rather a "sentence containing the query".
This issue also extends to the other analysis tools, such as KWIC, where the left/right contexts will consist of whole sentences instead of a couple words before/after the match.
To Reproduce Steps to reproduce the behavior:
- Import Chinese text
- Go to Analyze
- Run Queries
- Run KWIC
Expected behavior CATMA should not take whitespace as the word boundary delimiter for CJK scripts as it does with Latin script. Either it could use a proper Chinese segmentation tool, or it could use a single-character approach, where each Chinese character is treated as its own word.
Information about your environment
- OS: Windows 10 Enterprise 22H2
- Browser: Firefox 102.11.0esr (64-Bit)
Additional context A mature segmentation tool for Chinese would for example be Jieba
@pcdi Just a quick note to say thank you for your recent submissions here! We're currently quite distracted by the release of CATMA 7 and associated tasks, but hopefully we can take a look at these issues in the not-too-distant future.