Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Multilingual dialogue data

Open pruksmhc opened this issue 2 years ago • 3 comments

Once we're ready for a multilingual effort, we should include dialogue data in instruction dataset.

Here are some sources:

  • XPersona - chit chat data
  • Multi2WoZ (task oriented dialogue) - https://github.com/umanlp/Multi2WOZ
  • GlobalWoZ

The first step is to get a SFT model that is sufficiently multilingual (I know Eleuther's Polyglot is working on a cleaner multilingual corpus and is crawling mutilingual).

pruksmhc avatar Jan 22 '23 23:01 pruksmhc

Update: Just glanced over the Thai dataset for GlobalWoZ - it is fairly poor quality, with a lot of grammatical errors. We should get people with various language backgrounds to verify the quality of these datasets.

pruksmhc avatar Jan 22 '23 23:01 pruksmhc

watching this.

sbmaruf avatar Jan 24 '23 01:01 sbmaruf

@sbmaruf if you would like to get started with this go for it! I'm doing the WikiLingua dataset.

pruksmhc avatar Jan 24 '23 03:01 pruksmhc

Closing old data issue.

andreaskoepf avatar Jun 14 '23 09:06 andreaskoepf