fairseq2
fairseq2 copied to clipboard
Introduce mBart
What does this PR do? Please describe: Implements the mBart model and its text tokenizer. We are able to successfully load the base model.
Testing the text tokenizer:
VocabularyInfo(size=65539, unk_idx=3, bos_idx=0, eos_idx=2, pad_idx=1)
0 <s>
1 <pad>
2 </s>
3 <unk>
4 .
5 ,
65530 pleasant
65531 ▁glycogen
65532 criminalization
65533 ▁varietal
65534 ▁duplicating
65535 ▁protester
65536 [en]
65537 [es]
65538 <mask>
sample_tokens=tensor([65536, 0, 655, 9692, 2049, 19, 22, 146, 31, 29678,
13, 1845, 17277, 4120, 5, 56, 22, 15, 5277, 4,
2], dtype=torch.int32)
decoded_str='Some theories suggest that it may have descendants in Manchuria, but it is unlikely.'
prefix_indices: tensor([65536, 0])
suffix_indices: tensor([2])
encoded_tokens=tensor([65536, 0, 655, 9692, 2049, 19, 22, 146, 31, 29678,
13, 1845, 17277, 4120, 5, 56, 22, 15, 5277, 4,
2])
round_trip_str='Some theories suggest that it may have descendants in Manchuria, but it is unlikely.'
We see that the encoded_tokens is the same as the sample_tokens and the decoded_str is the same as the round_trip_str.
TODO: Check parity for forward pass through the same checkpoint with fairseq1.
Fixes #{issue number}
Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible changes.
Check list:
- [ ] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
- [ ] Did you read the contributor guideline?
- [ ] Did you make sure that your PR does only one thing instead of bundling different changes together?
- [ ] Did you make sure to update the documentation with your changes? (if necessary)
- [ ] Did you write any new necessary tests?
- [ ] Did you verify new and existing tests pass locally with your changes?
- [ ] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)
@cbalioglu I'm yet to verify parity with the fairseq mBart model by running forward passes. The asset has an internal checkpoint, wondering what the best way to open-source that would be.
@cbalioglu I'm yet to verify parity with the fairseq mBart model by running forward passes. The asset has an internal checkpoint, wondering what the best way to open-source that would be.
You can use one of mBARTs public checkpoints here (e.g. mbart.CC25) to verify parity and include it as an asset card in your PR.