semantic-kernel Python: CJK support for text splitter

Motivation and Context

The text splitting algorithm has hardcoded values for punctuation based on ASCII punctuation (,.; etc.) Chinese and Japanese use both full-width Unicode punctuation values https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block) or ideographic punctuation. If you give the text splitter Japanese text it will incorrectly split in the middle of sentences The current algorithm defaults to an approximation token counter, which is partly accurate for Indo-European/Latin Alphabet languages but totally out for CJK. This tests uses the BPE token length for cl100k_base (GPT3/4)

This is the Python version of this PR

Description

This change adds some basic full-width characters and ideographic punctuation to the built-in list. It also adds a test which checks that it splits a sentence correctly. The first part of the test string (田中の猫はかわいいですね.) is 16 tokens for cl100k_base.

Contribution Checklist

The code builds clean without any errors or warnings The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations All unit tests pass, and I have added new tests where possible I didn't break anything!

Mar 18 '24 14:03 marlenezw

Python 3.8 Test Coverage Report •

File	Stmts	Miss	Cover	Missing
semantic_kernel/text
text_chunker.py	113	4	96%	208, 231, 248, 308
TOTAL	5365	987	82%

Python 3.8 Unit Test Overview

Tests	Skipped	Failures	Errors	Time
1208	11 :zzz:	0 :x:	0 :fire:	15.992s :stopwatch:

Mar 18 '24 14:03 markwallace-microsoft

Adding related issue

Mar 18 '24 14:03 marlenezw

Hi @marlenezw, thanks for working on this. Could you please pull latest main, and re-build the poetry lock file from the Python directory? (poetry lock --no-update)

May 07 '24 16:05 moonbox3

Hi @marlenezw, since we haven't heard from you in several weeks, we will close the PR. When ready, please re-open the PR against latest code from main. Thank you.

May 31 '24 17:05 moonbox3