Open-Assistant
Open-Assistant copied to clipboard
Multilingual dialogue data
Once we're ready for a multilingual effort, we should include dialogue data in instruction dataset.
Here are some sources:
- XPersona - chit chat data
- Multi2WoZ (task oriented dialogue) - https://github.com/umanlp/Multi2WOZ
- GlobalWoZ
The first step is to get a SFT model that is sufficiently multilingual (I know Eleuther's Polyglot is working on a cleaner multilingual corpus and is crawling mutilingual).
Update: Just glanced over the Thai dataset for GlobalWoZ - it is fairly poor quality, with a lot of grammatical errors. We should get people with various language backgrounds to verify the quality of these datasets.
watching this.
@sbmaruf if you would like to get started with this go for it! I'm doing the WikiLingua dataset.
Closing old data issue.