Add Japanese tokenizer (fugashi) and minimal unit test

Open supernaiter opened this issue 5 months ago • 2 comments

What does this PR do?

This PR adds minimal support for Japanese prompt tokenization.

What's included:

A Japanese tokenizer utility (tokenize_jp) using fugashi + unidic-lite
A Unicode-based language detector (is_japanese_text) to support lang="auto"
Minimal unit tests for tokenizer correctness
setup.py updated with extras_require["ja"] to optionally install Japanese dependencies

This is the first step toward enabling Japanese prompt compression, designed to be self-contained and safe to merge.
Future work (e.g., integration into compress_prompt) will follow as separate PRs.

Fixes: N/A

Before submitting

[x] This PR is a new feature.
[x] Changes are backward-compatible.
[x] Tests for new functionality are included.
[x] No documentation changes are needed at this stage.

Who can review?

@iofu728 @SiyunZhao — this is a minimal PR for Japanese support. Would love your input before we follow up with lang="ja" integration.

Jul 12 '25 22:07 supernaiter

@microsoft-github-policy-service agree

Jul 12 '25 22:07 supernaiter

Hi @iofu728 @SiyunZhao — This is a minimal PR to support Japanese prompt tokenization. All tests passed, CLA is signed, and the PR is self-contained.

Would love your feedback or approval when convenient. Thanks!

Jul 12 '25 22:07 supernaiter