Open-Assistant
Open-Assistant copied to clipboard
The Removal of WikiHow
Please read this first https://github.com/LAION-AI/Open-Assistant/pull/3034 Wikihow has been removed from the OA dataset. I kindly ask to reconsider this as this sets a dangerous precedent for all datasets out there. I strongly believe that the principles and ethics of this decision should be openly discussed, and that a decision of this magnitude should not be made hastily in the same day. It is important that all opinions and arguments are heard before coming to a conclusion.
Here is what I wrote in the PR: Regarding the removal of the WikiHow dataset, I share your disappointment and believe that the decision made by the WikiHow team was unjustified considering the nature of both projects. The fact that the wikiHow content is licensed under Creative Commons. An unported License indicates that reuse and distribution are permitted, provided that attribution is given (if they want credit, let's just give it to them). Therefore, I don't see any clear reason why the wikiHow dataset couldn't be incorporated into the Open Assistant project.
It's important to note that open-source projects rely heavily on the contributions made by many individuals and other projects. It wouldn't make sense to exclude valuable, relevant sources simply because someone claims ownership over them. This is not how open source works and I could imagine that many contributors of the wikiHow platform, who spend countless hours contributing their knowledge to be licensed under the creative commons, would agree! Openness, sharing, and collaboration between projects should be encouraged to ensure the continued advancement of machine learning technology.
Finally, I would like to add that many closed-source LLMs like ChatGPT also rely on open-source projects to train their models and I think they do not offer an opt-out option. In contrast, Open Assistant is an open-source project that provides the opportunity for anyone to contribute and improve the dataset. By removing WikiHow articles, we are not only limiting the knowledge pool but also handicapping the project's potential unfairly. What is the next exclusion, the Wikipedia Foundation?
I think WikiHow will try to
- Add AI assistance or
- Say "AI is evil and we want to make sure we are not part of plagiarism( and that includes third-party access)"
I agree with your take and would also leave WikiHow as part of the OA dataset. Scrapping and training models on copyrighted (this data is under CC) is legal in Europe and USA. If they are not happy with the current form, they shall provide a target form or options other than retiring the data.
Do we know the reason given for their request? Seeing as how OA is open source and the WikiHow data is (according to this thread, haven't checked) released under Creative Commons, this, well, makes no sense. (Ok, yes, it makes sense if they are planning to release their own LLM chatbot and plan to make money off it) Like others said, this sets a dangerous precedent.