Open-Assistant
Open-Assistant copied to clipboard
Use Zhihu as training data for Chinese question answering problems
Apology first if this is a dupe to other issues related to multilingual training data set. Zhihu is a Chinese version of Quora. Naturally their texts have been organized as a QA format and its reddit-like voting system makes it easier to differentiate good answers from mediocre ones. Do we have a plan to build up multilingual training data set, such as Chinese prompts and responses. Also, what is the policy to scrape data from websites for training purposes? Last question: do we want to use machine translation to solve the multilingual problems or we would like to create non-English prompts&responses as well?
For those who wants to work on expanding the scope of OpenAssistant into Chinese, please DM me and let's chat about how to roll out the first version of feature.
@wangrui6 I am interested to help.
Hi I have assigned to both of you. Wanted to see if you've made progress and need anything. @wangrui6 @MLMonkATGY
Hi,
We got in touch with each other and researched on open APIs from Zhihu to onboard it to the Open Assistant data collection portal. https://open-assistant.io/dashboard
We are following https://projects.laion.ai/Open-Assistant/docs/data/datasets to first create datasets on HF. Before that, I am checking their permissive license.
On Thu, Feb 23, 2023 at 10:27 PM ontocord @.***> wrote:
Hi I have assigned to both of you. Wanted to see if you've made progress and need anything. @wangrui6 https://github.com/wangrui6 @MLMonkATGY https://github.com/MLMonkATGY
— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/1459#issuecomment-1442865692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQW2DYUSXIKJS6JOHTDOP3WZBIDRANCNFSM6AAAAAAUYRJB5A . You are receiving this because you were mentioned.Message ID: @.***>
@ontocord A proof-of-concept dataset in parquet format has been uploaded to https://huggingface.co/datasets/wangrui6/Zhihu-KOL
Can you double check the format?
@ontocord Do we need to append the citation at the end of each answer from the website?