Add Japanese tokenizer (fugashi) and minimal unit test
What does this PR do?
This PR adds minimal support for Japanese prompt tokenization.
What's included:
- A Japanese tokenizer utility (
tokenize_jp) using fugashi + unidic-lite - A Unicode-based language detector (
is_japanese_text) to supportlang="auto" - Minimal unit tests for tokenizer correctness
setup.pyupdated withextras_require["ja"]to optionally install Japanese dependencies
This is the first step toward enabling Japanese prompt compression, designed to be self-contained and safe to merge.
Future work (e.g., integration into compress_prompt) will follow as separate PRs.
Fixes: N/A
Before submitting
- [x] This PR is a new feature.
- [x] Changes are backward-compatible.
- [x] Tests for new functionality are included.
- [x] No documentation changes are needed at this stage.
Who can review?
@iofu728 @SiyunZhao — this is a minimal PR for Japanese support.
Would love your input before we follow up with lang="ja" integration.
@microsoft-github-policy-service agree
Hi @iofu728 @SiyunZhao — This is a minimal PR to support Japanese prompt tokenization. All tests passed, CLA is signed, and the PR is self-contained.
Would love your feedback or approval when convenient. Thanks!