budoux icon indicating copy to clipboard operation
budoux copied to clipboard

[quality] Long words in zh-hans model (20198 suggested changes)

Open peterburk opened this issue 11 months ago • 0 comments

Thank you for making budoux! I've been using it actively for Chinese (Traditional), Chinese (Simplified), Japanese, and Thai. It's very fast, and I really appreciate your work on it!

Input: UNv1.0.en-zh.zh

Process: cat "/Users/peter/Downloads/budoux-main/UNv1.0.en-zh.zh" | python3 budoux/main.py -m 'budoux/models/zh-hans.json' > "/Users/peter/Downloads/budoux-main/UNv1.0.en-zh.zhSpaced.txt"

Expected output (sample):

基 皮亚 克 土著马 赛 群体 争取 生存 计划 释放 利比亚国民 阿卜杜勒 巴塞特 波斯尼亚 - 克罗地亚 - 塞尔维亚 ( 阿波 斯托 洛斯安 德 列 亚斯 角)

Expected output (full): UNv1.0.en-zh.zhSpacedWordsOver5CharactersSpaced.txt

Expected output is built using a development copy of https://pingtype.github.io/

Actual output (sample):

基皮亚克土著马赛群体争取生存计划 释放利比亚国民阿卜杜勒巴塞特 波斯尼亚-克罗地亚-塞尔维亚 (阿波斯托洛斯安德列亚斯角)

Actual output (full) UNv1.0.en-zh.zhSpacedWordsOver5Characters.txt

Please message me if you have any more questions, and I'd be happy to advise. I also have more data for long words (over 5 characters) in Japanese, Chinese (Traditional), and Thai - please comment here or email me when you're working on this issue, and I can collaborate with you more :)

peterburk avatar Jan 10 '25 09:01 peterburk