Open-Assistant
Open-Assistant copied to clipboard
Make Open Assistant Content Publicly Accessible
Open Assistant is a valuable resource for curating data for the RLHF on a language model, but its potential is currently being underutilized due to the limited accessibility of its content. As a highly curated source of knowledge, it has the potential to be a valuable resource for those seeking information in a specific area. However, without the ability to easily access this information, it is not fulfilling its full potential.
I propose that the content on Open Assistant be made publicly available, making it a valuable source of knowledge like Wikipedia or StackExchange. This would allow a wider audience to benefit from the curated information and expertise available on the website. The project could have started from the codidact or discourse codebases, which would have provided a more user-friendly platform for accessing and utilizing the curated content. I hope it moves in that direction.
The collected data will be released under CC BY 4.0 after cleaning. Regarding the interface to browse the dataset or more freedom during task selection: Please feel free to write a complete proposal or develop in that direction and send in PRs.
I think that it will be super valuable for even non-cleaned data to be released, so as to allow for other projects to use different cleaning techniques in order to create their own versions of open assistant with different personalities/skills/specialisations/use-cases etc.
I think that it will be super valuable for even non-cleaned data to be released,
I think "cleaning" in this case means removal of personal data and other data that could get people in trouble if released without at least making a sincere effort to remove it. Hopefully this will be a very small amount. As far as I understood, all data that can lawfully be released will be released, both very low and high quality, so anyone can do their own filtering.
I think that it will be super valuable for even non-cleaned data to be released,
I think "cleaning" in this case means removal of personal data and other data that could get people in trouble if released without at least making a sincere effort to remove it. Hopefully this will be a very small amount. As far as I understood, all data that can lawfully be released will be released, both very low and high quality, so anyone can do their own filtering.
Yes, we don't want to release PII or similar so we will do some cleaning before any releases. We have other issues for tracking data release so closing this, feel free to comment again if any follow up questions