fairseq2 Introduce mBart

What does this PR do? Please describe: Implements the mBart model and its text tokenizer. We are able to successfully load the base model.

Testing the text tokenizer:

VocabularyInfo(size=65539, unk_idx=3, bos_idx=0, eos_idx=2, pad_idx=1)
0 <s>
1 <pad>
2 </s>
3 <unk>
4 .
5 ,
65530 pleasant
65531 ▁glycogen
65532 criminalization
65533 ▁varietal
65534 ▁duplicating
65535 ▁protester
65536 [en]
65537 [es]
65538 <mask>
sample_tokens=tensor([65536,     0,   655,  9692,  2049,    19,    22,   146,    31, 29678,
           13,  1845, 17277,  4120,     5,    56,    22,    15,  5277,     4,
            2], dtype=torch.int32)
decoded_str='Some theories suggest that it may have descendants in Manchuria, but it is unlikely.'
prefix_indices:  tensor([65536,     0])
suffix_indices:  tensor([2])
encoded_tokens=tensor([65536,     0,   655,  9692,  2049,    19,    22,   146,    31, 29678,
           13,  1845, 17277,  4120,     5,    56,    22,    15,  5277,     4,
            2])
round_trip_str='Some theories suggest that it may have descendants in Manchuria, but it is unlikely.'

We see that the encoded_tokens is the same as the sample_tokens and the decoded_str is the same as the round_trip_str.

TODO: Check parity for forward pass through the same checkpoint with fairseq1.

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible changes.

Check list:

[ ] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
[ ] Did you read the contributor guideline?
[ ] Did you make sure that your PR does only one thing instead of bundling different changes together?
[ ] Did you make sure to update the documentation with your changes? (if necessary)
[ ] Did you write any new necessary tests?
[ ] Did you verify new and existing tests pass locally with your changes?
[ ] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

Sep 09 '23 22:09 kauterry

@cbalioglu I'm yet to verify parity with the fairseq mBart model by running forward passes. The asset has an internal checkpoint, wondering what the best way to open-source that would be.

Sep 11 '23 17:09 kauterry

@cbalioglu I'm yet to verify parity with the fairseq mBart model by running forward passes. The asset has an internal checkpoint, wondering what the best way to open-source that would be.

You can use one of mBARTs public checkpoints here (e.g. mbart.CC25) to verify parity and include it as an asset card in your PR.

Sep 11 '23 17:09 cbalioglu

fairseq2 fairseq2 copied to clipboard

Introduce mBart

fairseq2
fairseq2 copied to clipboard