Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Use Zhihu as training data for Chinese question answering problems

Open wangrui6 opened this issue 2 years ago • 6 comments

Apology first if this is a dupe to other issues related to multilingual training data set. Zhihu is a Chinese version of Quora. Naturally their texts have been organized as a QA format and its reddit-like voting system makes it easier to differentiate good answers from mediocre ones. Do we have a plan to build up multilingual training data set, such as Chinese prompts and responses. Also, what is the policy to scrape data from websites for training purposes? Last question: do we want to use machine translation to solve the multilingual problems or we would like to create non-English prompts&responses as well?

wangrui6 avatar Feb 11 '23 07:02 wangrui6

For those who wants to work on expanding the scope of OpenAssistant into Chinese, please DM me and let's chat about how to roll out the first version of feature.

wangrui6 avatar Feb 12 '23 19:02 wangrui6

@wangrui6 I am interested to help.

GeeYangML avatar Feb 13 '23 12:02 GeeYangML

Hi I have assigned to both of you. Wanted to see if you've made progress and need anything. @wangrui6 @MLMonkATGY

huu4ontocord avatar Feb 24 '23 06:02 huu4ontocord

Hi,

We got in touch with each other and researched on open APIs from Zhihu to onboard it to the Open Assistant data collection portal. https://open-assistant.io/dashboard

We are following https://projects.laion.ai/Open-Assistant/docs/data/datasets to first create datasets on HF. Before that, I am checking their permissive license.

On Thu, Feb 23, 2023 at 10:27 PM ontocord @.***> wrote:

Hi I have assigned to both of you. Wanted to see if you've made progress and need anything. @wangrui6 https://github.com/wangrui6 @MLMonkATGY https://github.com/MLMonkATGY

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/1459#issuecomment-1442865692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQW2DYUSXIKJS6JOHTDOP3WZBIDRANCNFSM6AAAAAAUYRJB5A . You are receiving this because you were mentioned.Message ID: @.***>

wangrui6 avatar Feb 24 '23 19:02 wangrui6

@ontocord A proof-of-concept dataset in parquet format has been uploaded to https://huggingface.co/datasets/wangrui6/Zhihu-KOL

Can you double check the format?

wangrui6 avatar Feb 25 '23 00:02 wangrui6

@ontocord Do we need to append the citation at the end of each answer from the website?

wangrui6 avatar Feb 25 '23 05:02 wangrui6