Python: CJK support for text splitter
Motivation and Context
The text splitting algorithm has hardcoded values for punctuation based on ASCII punctuation (,.; etc.) Chinese and Japanese use both full-width Unicode punctuation values https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block) or ideographic punctuation. If you give the text splitter Japanese text it will incorrectly split in the middle of sentences The current algorithm defaults to an approximation token counter, which is partly accurate for Indo-European/Latin Alphabet languages but totally out for CJK. This tests uses the BPE token length for cl100k_base (GPT3/4)
This is the Python version of this PR
Description
This change adds some basic full-width characters and ideographic punctuation to the built-in list. It also adds a test which checks that it splits a sentence correctly. The first part of the test string (田中の猫はかわいいですね.) is 16 tokens for cl100k_base.
Contribution Checklist
The code builds clean without any errors or warnings The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations All unit tests pass, and I have added new tests where possible I didn't break anything!
Python 3.8 Test Coverage Report •
File Stmts Miss Cover Missing semantic_kernel/text text_chunker.py 113 4 96% 208, 231, 248, 308 TOTAL 5365 987 82%
Python 3.8 Unit Test Overview
| Tests | Skipped | Failures | Errors | Time |
|---|---|---|---|---|
| 1208 | 11 :zzz: | 0 :x: | 0 :fire: | 15.992s :stopwatch: |
Adding related issue
Hi @marlenezw, thanks for working on this. Could you please pull latest main, and re-build the poetry lock file from the Python directory? (poetry lock --no-update)
Hi @marlenezw, since we haven't heard from you in several weeks, we will close the PR. When ready, please re-open the PR against latest code from main. Thank you.