PyRIT icon indicating copy to clipboard operation
PyRIT copied to clipboard

DOC Identify and Document Relevant Datasets from Safety Prompts

Open nina-msft opened this issue 1 year ago • 12 comments

We recently discovered https://safetyprompts.com/, which has so many datasets!

We need help going through the website and creating a list of relevant datasets. A relevant dataset is one which contains red teaming prompts for different harm categories. For each relevant dataset, highlight which columns (e.g. prompt column in #420) can be used for red teaming prompts and post that information as a comment in this issue.

Expected Format (Example)

Name: LLM-LAT/harmful-dataset Link: https://huggingface.co/datasets/LLM-LAT/harmful-dataset Relevant Columns: "prompt"

Additional Context

We have datasets documented under orchestrators here: https://github.com/search?q=repo%3AAzure%2FPyRIT%20The%20dataset%20sources%20can%20be%20found%20at&type=code

Those dataset fetch functions are here: https://github.com/Azure/PyRIT/blob/main/pyrit/datasets/fetch_example_datasets.py

nina-msft avatar Oct 03 '24 19:10 nina-msft

Hi! I'd like to work on this issue.

divyaamin9825 avatar Oct 04 '24 20:10 divyaamin9825

Hi! I'd like to work on this issue.

Lakshmiaddepalli avatar Oct 04 '24 20:10 Lakshmiaddepalli

Name: jailbreak_llms/data/forbidden_question Link: https://github.com/verazuo/jailbreak_llms/blob/main/data/forbidden_question/forbidden_question_set.csv Relevant Columns: "content_policy_name","question"

divyaamin9825 avatar Oct 04 '24 20:10 divyaamin9825

Name: LibrAI/do-not-answer Link: https://huggingface.co/datasets/LibrAI/do-not-answer Relevant Columns: "risk_area","types_of_harm","specific_harms","question"

divyaamin9825 avatar Oct 04 '24 20:10 divyaamin9825

Name: McGill-NLP/stereoset Link: https://huggingface.co/datasets/McGill-NLP/stereoset Relevant columns: "target", "bias_type", "context", "sentences"

divyaamin9825 avatar Oct 04 '24 21:10 divyaamin9825

Name: Anthropic/model-written-evals Link: https://huggingface.co/datasets/Anthropic/model-written-evals Relevant columns: "question", "answer_matching_behavior"

divyaamin9825 avatar Oct 04 '24 21:10 divyaamin9825

Name: Babelscape/ALERT Link: https://huggingface.co/datasets/Babelscape/ALERT Relevant columns: "category", "prompt"

divyaamin9825 avatar Oct 04 '24 21:10 divyaamin9825

Thanks so much for your contributions here @divyaamin9825 🦝

nina-msft avatar Oct 04 '24 21:10 nina-msft

Name: SALT-NLP/mic Link: https://www.dropbox.com/sh/m46z42nce8x0ttk/AABuSZiA6ESyrJNWmgTPrfuRa?dl=0 Relevant columns: "Q","A","rot","moral","rot-agree","A_agrees","violation-severity","worker_answer"

Note: rot is rule of thumb

divyaamin9825 avatar Oct 04 '24 22:10 divyaamin9825

Name: hendrycks/ethics Link: https://huggingface.co/datasets/hendrycks/ethics Relevant columns: "label","input"

divyaamin9825 avatar Oct 04 '24 22:10 divyaamin9825

Just created issues for all of the datasets Divya flagged above!

We'll keep this issue open for further dataset curation from Safety Prompts - in case anyone is keen :-)

nina-msft avatar Oct 10 '24 17:10 nina-msft

Either safetyprompts.com has been updated since the last passthrough or we still have a bunch missing. Either way, plenty more to list so that we can create work items here. If anyone wants to create the list with links and a list of relevant columns for prompt content and harm categories please post below.

romanlutz avatar Mar 18 '25 21:03 romanlutz