llm-foundry
llm-foundry copied to clipboard
multilingual ability
how about the multilingual ability of MPT
We haven't performed any multi-lingual evaluation yet. Are there any multi-lingual benchmarks you'd want us to evaluate on?
As far as qualitative eval, we have had users comment that the chat version of the model is particularly good at translation for a 7B model, at least Romance languages
do you confirm that the only multilingual datasets are the same as llama ie. all the 20 language wikipedia stuff ? thanks
Hi @vince62s , the training data mix was curated by our MosaicML NLP team. You can see the details in our blog here: https://www.mosaicml.com/blog/mpt-7b in the 'Data' and 'Appendix' sections. We strongly filtered for English before training so any multilingual support is either luck or leakage.
If you have more specific questions, please let us know, otherwise I'll close this issue tomorrow.
Well the appendix is not so clear. It says for mc4 it was filtered on English. But for Redpajamas it does not say and it includes Wikipedia. The goal of redpajamas is to reolicate llama for which with Wikipedia it includes 20 languages. A clarification would be great.
Thanks for the close read, @vince62s! We forgot to mention in the blog post that we only used the English subset of Wikipedia. Our current going hypothesis is that there is multilingual data in the Markdown subset of The Stack (which isn't language-filtered), and that language filtering is imperfect—documents that are deemed to "contain English" by whatever filtering tool can still also contain other languages.
Ok then it will not be as good as llama for multilingual ability (conversation in another language) or translation. It's too bad and remove a non negligible part of the llm feature. The rest is really good.
@vince62s I recommend testing it. We've been very surprised with its multilingual abilities
I will, but even when finetuning with wikipedia + cc-net with 3 languages (EN/DE/FR) the loss remains a little high. I will test further with translation finetuning.
Hi @vince62s , did you get all your questions answered regarding multilingual capabilities? I'm going through issue cleanup and will close this tomorrow if all is well.