FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Language distribution of ShareGPT 70K conversation dataset for FastChat T5

Open Mihir2 opened this issue 2 years ago • 2 comments

What are all the languages present in the ShareGPT 70,000 conversation dataset which was used to fine-tune FastChat-T5?

The ReadMe file points to data_cleaning.md which was used to get data from ShareGPT. Within data_cleaning.md seems like sharegpt_clean_lang.json contains the list of languages in consideration and some languages are skipped.

Mihir2 avatar Jun 05 '23 08:06 Mihir2

how can i finetune with bounds of datasets?

kkkparty avatar Jul 20 '23 03:07 kkkparty

What are all the languages present in the ShareGPT 70,000 conversation dataset which was used to fine-tune FastChat-T5?

The ReadMe file points to data_cleaning.md which was used to get data from ShareGPT. Within data_cleaning.md seems like sharegpt_clean_lang.json contains the list of languages in consideration and some languages are skipped.

Hi I have the same question about the language distribution, do you have any idea?

Z1zs avatar Apr 08 '24 12:04 Z1zs