Add `add_language_codes` helper to `NllbTokenizer` for new language codes
Add a small helper to make it safer and easier to register new NLLB language codes when fine-tuning on languages that are not part of FAIRSEQ_LANGUAGE_CODES. This makes it easier for low-resource language researchers to add new language code(s), without having to worry about breaking the list or the <mask> token.
Concretely, it:
- Adds
NllbTokenizer.add_language_codes(...), a convenience wrapper aroundadd_special_tokens - Adds an integration-style test
- Updates the
NllbTokenizerdocstring so that the new helper appears in the rendered docs.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Tests run locally:
pytest tests/models/nllb/test_tokenization_nllb.py -sv
Who can review?
Anyone in the community is welcome to review. Tagging tokenizer maintainers who might be interested:
- @ArthurZucker
- @itazap
Hey :) Thanks for the PR. It's great that support for new language codes can be supported. I think we can leave this feature to be covered by the existing add_special_tokens API:
tokenizer.add_special_tokens(
{"extra_special_tokens": ["ami_Latn"]}, # or "additional_special_tokens" if pre v5
replace_extra_special_tokens=False #
)
# Get token IDs if needed
token_ids = [tokenizer.convert_tokens_to_ids(code) for code in ["ami_Latn"]]
The existing API already handles deduplication (in tokenization_utils_base.py), so calling it multiple times with the same codes is safe! The main thing to remember is setting replace_extra_special_tokens=False to append rather than replace the existing language codes.
Thanks for identifying this use case. If you'd like, this can be added to the NLLB documentation to make it more discoverable for others.
[For maintainers] Suggested jobs to run (before merge)
run-slow: afmoe, apertus, arcee, aria
Closed in accordance with clarifying comment from collaborators, will make a clean PR to update docs