kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

[Bug] When using Chinese markdown file, GetTokens function returns incorrectly

Open hty579 opened this issue 9 months ago • 1 comments

Context / Scenario

When using Chinese markdown file, GetTokens function returns incorrectly.

CL100KTokenizer cL100KTokenizer=new CL100KTokenizer(); var result= cL100KTokenizer.GetTokens("交通运输部关于发布《公路桥涵设计通用规范》的公告\r\n现发布《公路桥涵设计通用规范》(JTG D60-2015),作为公路工程行业标准,自 2015 年 12 月 1 日起施行,原《公路桥涵设计通用规范》(JTG D60-2004)同时废止。");

What happened?

Get the right result, this will affect MarkDownChunker's data splitting.

Importance

a fix would make my life easier

Platform, Language, Versions

C# Microsoft.KernelMemory.Core V0.98.250324.1

Relevant log output


hty579 avatar Apr 02 '25 15:04 hty579