dolly Added datasets in 5 languages

Added datasets in 5 languages

Open Lednik7 opened this issue 2 years ago • 3 comments

trafficstars

Hi, I used googletrans==3.1.0.0 for languages: Russian, Kazakh, Spanish, Italian, French

translated-datasets/databricks-dolly-15k-{language}.jsonl

translation of databricks-dolly-15k.jsonl for language
part of the data (up to 600) was lost during the translation

translated-datasets/databricks-dolly-15k-parallel-corpus-6.csv

includes a uid for which there are translations into several languages
contains the original data set

Apr 20 '23 18:04 Lednik7

Oh that's neat. I'm not sure whether the other project folks want this hosted here vs just letting people host their own derived datasets. It's a great point though, you can probably get fairly far in making a model for a different language by machine-translating the fine-tuning dataset. It still doesn't mean the base model's dataset is in that language, and that's where it 'learned' most of the language from. But I'm personally curious how well the results work for these languages!

Apr 20 '23 22:04 srowen

https://www.kaggle.com/datasets/mygaps/databricks-dolly-15k-parallel-corpus-6

I have published a dataset on kaggle

Apr 20 '23 22:04 Lednik7

You can put them on HF too! I'm seeing things like https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja as well

Apr 21 '23 16:04 srowen

Rather than check in translations of the dataset into this repo I think it's preferable to upload the translation to Hugging Face. We can add a note in the README to suggest checking Hugging Face for various translations.

Apr 21 '23 23:04 matthayes

dolly dolly copied to clipboard

Added datasets in 5 languages

dolly
dolly copied to clipboard