llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

multilingual ability

Open yangjianxin1 opened this issue 1 year ago • 9 comments

how about the multilingual ability of MPT

yangjianxin1 avatar May 08 '23 09:05 yangjianxin1

We haven't performed any multi-lingual evaluation yet. Are there any multi-lingual benchmarks you'd want us to evaluate on?

bmosaicml avatar May 08 '23 15:05 bmosaicml

As far as qualitative eval, we have had users comment that the chat version of the model is particularly good at translation for a 7B model, at least Romance languages

samhavens avatar May 11 '23 00:05 samhavens

do you confirm that the only multilingual datasets are the same as llama ie. all the 20 language wikipedia stuff ? thanks

vince62s avatar May 11 '23 15:05 vince62s

Hi @vince62s , the training data mix was curated by our MosaicML NLP team. You can see the details in our blog here: https://www.mosaicml.com/blog/mpt-7b in the 'Data' and 'Appendix' sections. We strongly filtered for English before training so any multilingual support is either luck or leakage.

If you have more specific questions, please let us know, otherwise I'll close this issue tomorrow.

abhi-mosaic avatar May 11 '23 20:05 abhi-mosaic

Well the appendix is not so clear. It says for mc4 it was filtered on English. But for Redpajamas it does not say and it includes Wikipedia. The goal of redpajamas is to reolicate llama for which with Wikipedia it includes 20 languages. A clarification would be great.

vince62s avatar May 12 '23 04:05 vince62s

Thanks for the close read, @vince62s! We forgot to mention in the blog post that we only used the English subset of Wikipedia. Our current going hypothesis is that there is multilingual data in the Markdown subset of The Stack (which isn't language-filtered), and that language filtering is imperfect—documents that are deemed to "contain English" by whatever filtering tool can still also contain other languages.

growlix avatar May 15 '23 18:05 growlix

Ok then it will not be as good as llama for multilingual ability (conversation in another language) or translation. It's too bad and remove a non negligible part of the llm feature. The rest is really good.

vince62s avatar May 15 '23 19:05 vince62s

@vince62s I recommend testing it. We've been very surprised with its multilingual abilities

samhavens avatar May 15 '23 19:05 samhavens

I will, but even when finetuning with wikipedia + cc-net with 3 languages (EN/DE/FR) the loss remains a little high. I will test further with translation finetuning.

vince62s avatar May 15 '23 20:05 vince62s

Hi @vince62s , did you get all your questions answered regarding multilingual capabilities? I'm going through issue cleanup and will close this tomorrow if all is well.

abhi-mosaic avatar May 17 '23 22:05 abhi-mosaic