Open-Assistant
Open-Assistant copied to clipboard
dataset: NSFW CSAM dataset from Reddit
NSFW - CSAM from Reddit
Note(TODO): this is the pipeline, will need to scale this dataset by getting data from file.pushshift.io
The scripts and notebooks in the directory are used to create the NSFW and CSAM dataset from Reddit that can be used to train the safety model.
Currently, the data is pulled from Reddit's API using PRAW. This has a lot of limitations and I believe we can get more data by using file.pushshift.io as mentioned in #53 if this dataset (and data format) is accepted I will scale up the data ingestion and we'll hopefully get more data.
Data stored in: jjmachan/NSFW-questions
fixes: #1932
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
Here's the link to the dataset aligned with the pro-social dataset used for safety bot training. https://huggingface.co/datasets/shahules786/prosocial-nsfw
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
Can we change the directory name to reddit_nsfw or similar please? There's no CSAM allowed on Reddit.