dolly
dolly copied to clipboard
Added datasets in 5 languages
Hi, I used googletrans==3.1.0.0 for languages: Russian, Kazakh, Spanish, Italian, French
translated-datasets/databricks-dolly-15k-{language}.jsonl
- translation of databricks-dolly-15k.jsonl for language
- part of the data (up to 600) was lost during the translation
translated-datasets/databricks-dolly-15k-parallel-corpus-6.csv
- includes a uid for which there are translations into several languages
- contains the original data set
Oh that's neat. I'm not sure whether the other project folks want this hosted here vs just letting people host their own derived datasets. It's a great point though, you can probably get fairly far in making a model for a different language by machine-translating the fine-tuning dataset. It still doesn't mean the base model's dataset is in that language, and that's where it 'learned' most of the language from. But I'm personally curious how well the results work for these languages!
https://www.kaggle.com/datasets/mygaps/databricks-dolly-15k-parallel-corpus-6
I have published a dataset on kaggle
You can put them on HF too! I'm seeing things like https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja as well
Rather than check in translations of the dataset into this repo I think it's preferable to upload the translation to Hugging Face. We can add a note in the README to suggest checking Hugging Face for various translations.