semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Python: CJK support for text splitter

Open marlenezw opened this issue 1 year ago • 3 comments

Motivation and Context

The text splitting algorithm has hardcoded values for punctuation based on ASCII punctuation (,.; etc.) Chinese and Japanese use both full-width Unicode punctuation values https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block) or ideographic punctuation. If you give the text splitter Japanese text it will incorrectly split in the middle of sentences The current algorithm defaults to an approximation token counter, which is partly accurate for Indo-European/Latin Alphabet languages but totally out for CJK. This tests uses the BPE token length for cl100k_base (GPT3/4)

This is the Python version of this PR

Description

This change adds some basic full-width characters and ideographic punctuation to the built-in list. It also adds a test which checks that it splits a sentence correctly. The first part of the test string (田中の猫はかわいいですね.) is 16 tokens for cl100k_base.

Contribution Checklist

The code builds clean without any errors or warnings The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations All unit tests pass, and I have added new tests where possible I didn't break anything!

marlenezw avatar Mar 18 '24 14:03 marlenezw

Py3.8 Test Coverage

Python 3.8 Test Coverage Report •
FileStmtsMissCoverMissing
semantic_kernel/text
   text_chunker.py113496%208, 231, 248, 308
TOTAL536598782% 

Python 3.8 Unit Test Overview

Tests Skipped Failures Errors Time
1208 11 :zzz: 0 :x: 0 :fire: 15.992s :stopwatch:

markwallace-microsoft avatar Mar 18 '24 14:03 markwallace-microsoft

Adding related issue

marlenezw avatar Mar 18 '24 14:03 marlenezw

Hi @marlenezw, thanks for working on this. Could you please pull latest main, and re-build the poetry lock file from the Python directory? (poetry lock --no-update)

moonbox3 avatar May 07 '24 16:05 moonbox3

Hi @marlenezw, since we haven't heard from you in several weeks, we will close the PR. When ready, please re-open the PR against latest code from main. Thank you.

moonbox3 avatar May 31 '24 17:05 moonbox3