Open-Assistant Multilingual dialogue data

Multilingual dialogue data

Open pruksmhc opened this issue 2 years ago • 3 comments

Once we're ready for a multilingual effort, we should include dialogue data in instruction dataset.

Here are some sources:

XPersona - chit chat data
Multi2WoZ (task oriented dialogue) - https://github.com/umanlp/Multi2WOZ
GlobalWoZ

The first step is to get a SFT model that is sufficiently multilingual (I know Eleuther's Polyglot is working on a cleaner multilingual corpus and is crawling mutilingual).

Jan 22 '23 23:01 pruksmhc

Update: Just glanced over the Thai dataset for GlobalWoZ - it is fairly poor quality, with a lot of grammatical errors. We should get people with various language backgrounds to verify the quality of these datasets.

Jan 22 '23 23:01 pruksmhc

watching this.

Jan 24 '23 01:01 sbmaruf

@sbmaruf if you would like to get started with this go for it! I'm doing the WikiLingua dataset.

Jan 24 '23 03:01 pruksmhc

Closing old data issue.

Jun 14 '23 09:06 andreaskoepf

Open-Assistant Open-Assistant copied to clipboard

Multilingual dialogue data

Open-Assistant
Open-Assistant copied to clipboard