transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add `add_language_codes` helper to `NllbTokenizer` for new language codes

Open hunterschep opened this issue 2 weeks ago • 1 comments

Add a small helper to make it safer and easier to register new NLLB language codes when fine-tuning on languages that are not part of FAIRSEQ_LANGUAGE_CODES. This makes it easier for low-resource language researchers to add new language code(s), without having to worry about breaking the list or the <mask> token.

Concretely, it:

  • Adds NllbTokenizer.add_language_codes(...), a convenience wrapper around add_special_tokens
  • Adds an integration-style test
  • Updates the NllbTokenizer docstring so that the new helper appears in the rendered docs.

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [x] Did you write any new necessary tests?

Tests run locally:

  • pytest tests/models/nllb/test_tokenization_nllb.py -sv

Who can review?

Anyone in the community is welcome to review. Tagging tokenizer maintainers who might be interested:

  • @ArthurZucker
  • @itazap

hunterschep avatar Dec 10 '25 02:12 hunterschep

Hey :) Thanks for the PR. It's great that support for new language codes can be supported. I think we can leave this feature to be covered by the existing add_special_tokens API:

tokenizer.add_special_tokens(
    {"extra_special_tokens": ["ami_Latn"]},  # or "additional_special_tokens" if pre v5
    replace_extra_special_tokens=False  #
)
# Get token IDs if needed
token_ids = [tokenizer.convert_tokens_to_ids(code) for code in ["ami_Latn"]]

The existing API already handles deduplication (in tokenization_utils_base.py), so calling it multiple times with the same codes is safe! The main thing to remember is setting replace_extra_special_tokens=False to append rather than replace the existing language codes.

Thanks for identifying this use case. If you'd like, this can be added to the NLLB documentation to make it more discoverable for others.

itazap avatar Dec 10 '25 10:12 itazap

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, apertus, arcee, aria

github-actions[bot] avatar Dec 10 '25 18:12 github-actions[bot]

Closed in accordance with clarifying comment from collaborators, will make a clean PR to update docs

hunterschep avatar Dec 10 '25 18:12 hunterschep