Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

A gamedev question-answer dataset proposal

Open kaydotdev opened this issue 2 years ago • 6 comments

Hey, everyone! 👋

Back in the day, I collected a question-answer dataset from a Unity3D Gaming Forum to build a chatbot for assisting in game development. Despite the project being abandoned due to the circumstances, I believe the dataset will be helpful for the Open Assistant. It contains both unprocessed forum questions in the raw.json file and cleaned + labeled question-answer pair (12K+ samples) in the intents.json file. It's publically available on the Kaggle platform under the CC0:Public Domain license: https://www.kaggle.com/datasets/antonkozyriev/unity3d-faq?select=intents.json

Also, the dataset does not contain personal information about users on a forum. The pipelines preprocessed all IDs, and user mentions.

I hope you will find my dataset useful and let me know whether I should upload it on HuggingFace!

kaydotdev avatar Feb 04 '23 17:02 kaydotdev

@antonAce that is really cool, it is great that the data is public domain, I wonder if it was also public domain on the game dev forums it was scrapped from?

AbdBarho avatar Feb 05 '23 08:02 AbdBarho

@AbdBarho yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain.

kaydotdev avatar Feb 05 '23 21:02 kaydotdev

thank you!

huu4ontocord avatar Feb 07 '23 04:02 huu4ontocord

yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain.

For the record, "public domain" doesn't refer to data being publicly available, but rather being free of copyright. The data here is most likely copyrighted, as the Berne Convention protects everything with some modicum of creativity. (Learn more here.)

That being said, scraping copyrighted data is par for the course for large language models, including the pretrained models this project will be using, so it's Probably Fine™. If not, there's always StackExchange, where answers are explicitly licensed for reuse under CC BY-SA (same as Wikipedia, though the weak-copyleft aspect can be problematic).

hecko-yes avatar Feb 07 '23 13:02 hecko-yes

yep, all information scraped from the website is accessible to a member of the general public, both raw and labeled data are in the public domain.

For the record, "public domain" doesn't refer to data being publicly available, but rather being free of copyright. The data here is most likely copyrighted, as the Berne Convention protects everything with some modicum of creativity. (Learn more here.)

That being said, scraping copyrighted data is par for the course for large language models, including the pretrained models this project will be using, so it's Probably Fine™. If not, there's always StackExchange, where answers are explicitly licensed for reuse under CC BY-SA (same as Wikipedia, though the weak-copyleft aspect can be problematic).

@Sobsz thank you for the suggestion! I was not able to find any information about forum data licensing on a website. But I've contacted with Unity Support team about this issue, and I'm waiting for a response.

kaydotdev avatar Feb 09 '23 15:02 kaydotdev

I might be wrong here (I'm not a lawyer) but I'll weigh in:

Usually with web forums, the individual posts are the copyright of the users who posted them. As part of sign-up they give the forum owners a license to use their contributions, which may or may not be transferable. This is the great web2.0 data heist; the platforms that collected the data can sell it (or themselves) as they have permission to use it, while anyone else would have to contact each individual user and ask for explicit permission.

In reality very few of those users know or care about that, they expected their posts to be publicly used, it's arguably fair use, and if someone does complain you can filter their posts out. Anonymize the poster ids and put the anonymized ones in the metadata, there's no PII. Using a subset rather than all the data might also help with fair use (using 10 posts someone put on a public forum is more likely to be fair use than using 1000 things they posted)

The forum owners may have additional rights over the entire collection in a) areas of the world where "database rights" are strong and b) where there's been creative input from a curator of said database. Posts in manually created "best of" forum may have database/collections rights, community moderated ones won't.

If you can get permission from the forum owner then it can be released under whatever license they say. If not and they don't assert any right over the collection, I'd say go for it and see if anyone complains.

bitplane avatar Feb 13 '23 01:02 bitplane