LLMLingua icon indicating copy to clipboard operation
LLMLingua copied to clipboard

Add Japanese tokenizer (fugashi) and minimal unit test

Open supernaiter opened this issue 5 months ago • 2 comments

What does this PR do?

This PR adds minimal support for Japanese prompt tokenization.

What's included:

  • A Japanese tokenizer utility (tokenize_jp) using fugashi + unidic-lite
  • A Unicode-based language detector (is_japanese_text) to support lang="auto"
  • Minimal unit tests for tokenizer correctness
  • setup.py updated with extras_require["ja"] to optionally install Japanese dependencies

This is the first step toward enabling Japanese prompt compression, designed to be self-contained and safe to merge.
Future work (e.g., integration into compress_prompt) will follow as separate PRs.


Fixes: N/A

Before submitting

  • [x] This PR is a new feature.
  • [x] Changes are backward-compatible.
  • [x] Tests for new functionality are included.
  • [x] No documentation changes are needed at this stage.

Who can review?

@iofu728 @SiyunZhao — this is a minimal PR for Japanese support. Would love your input before we follow up with lang="ja" integration.

supernaiter avatar Jul 12 '25 22:07 supernaiter

@microsoft-github-policy-service agree

supernaiter avatar Jul 12 '25 22:07 supernaiter

Hi @iofu728 @SiyunZhao — This is a minimal PR to support Japanese prompt tokenization. All tests passed, CLA is signed, and the PR is self-contained.

Would love your feedback or approval when convenient. Thanks!

supernaiter avatar Jul 12 '25 22:07 supernaiter