dolly icon indicating copy to clipboard operation
dolly copied to clipboard

Added datasets in 5 languages

Open Lednik7 opened this issue 2 years ago • 3 comments
trafficstars

Hi, I used googletrans==3.1.0.0 for languages: Russian, Kazakh, Spanish, Italian, French

translated-datasets/databricks-dolly-15k-{language}.jsonl

  • translation of databricks-dolly-15k.jsonl for language
  • part of the data (up to 600) was lost during the translation

translated-datasets/databricks-dolly-15k-parallel-corpus-6.csv

  • includes a uid for which there are translations into several languages
  • contains the original data set

Lednik7 avatar Apr 20 '23 18:04 Lednik7

Oh that's neat. I'm not sure whether the other project folks want this hosted here vs just letting people host their own derived datasets. It's a great point though, you can probably get fairly far in making a model for a different language by machine-translating the fine-tuning dataset. It still doesn't mean the base model's dataset is in that language, and that's where it 'learned' most of the language from. But I'm personally curious how well the results work for these languages!

srowen avatar Apr 20 '23 22:04 srowen

https://www.kaggle.com/datasets/mygaps/databricks-dolly-15k-parallel-corpus-6

I have published a dataset on kaggle

Lednik7 avatar Apr 20 '23 22:04 Lednik7

You can put them on HF too! I'm seeing things like https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja as well

srowen avatar Apr 21 '23 16:04 srowen

Rather than check in translations of the dataset into this repo I think it's preferable to upload the translation to Hugging Face. We can add a note in the README to suggest checking Hugging Face for various translations.

matthayes avatar Apr 21 '23 23:04 matthayes