Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

dataset: NSFW CSAM dataset from Reddit

Open jjmachan opened this issue 2 years ago • 5 comments

NSFW - CSAM from Reddit

Note(TODO): this is the pipeline, will need to scale this dataset by getting data from file.pushshift.io

The scripts and notebooks in the directory are used to create the NSFW and CSAM dataset from Reddit that can be used to train the safety model.

Currently, the data is pulled from Reddit's API using PRAW. This has a lot of limitations and I believe we can get more data by using file.pushshift.io as mentioned in #53 if this dataset (and data format) is accepted I will scale up the data ingestion and we'll hopefully get more data.

Data stored in: jjmachan/NSFW-questions

fixes: #1932

jjmachan avatar Mar 04 '23 23:03 jjmachan

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Mar 04 '23 23:03 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Mar 05 '23 04:03 github-actions[bot]

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Mar 05 '23 06:03 github-actions[bot]

Here's the link to the dataset aligned with the pro-social dataset used for safety bot training. https://huggingface.co/datasets/shahules786/prosocial-nsfw

shahules786 avatar Mar 05 '23 06:03 shahules786

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

github-actions[bot] avatar Mar 05 '23 06:03 github-actions[bot]

Can we change the directory name to reddit_nsfw or similar please? There's no CSAM allowed on Reddit.

bitplane avatar Mar 10 '23 15:03 bitplane