FEAT Add NVIDIA AI Content Safety Dataset
Is your feature request related to a problem? Please describe.
Add this dataset into PyRIT - it is not currently apart of PyRIT yet. https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0 NVIDIA's model evaluation for content safety was accomplished using the content moderation benchmarks from this test set.
Describe the solution you'd like
The HuggingFace data set: https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0 The associated paper: https://openreview.net/pdf?id=0MvGCv35wi
Additional context
Similar to previous dataset contributions this should live in pyrit.datasets as a "fetch" function. Also, the harm_categories property should be set on each prompt. This huggingface dataset mentions each harm category under the "violated_categories" column. There are examples of how PyRIT interacts with other datasets here: https://github.com/search?q=repo%3AAzure%2FPyRIT%20%23%20The%20dataset%20sources%20can%20be%20found%20at%3A&type=code
[[Content Warning: Prompts are aimed at provoking the model, and may contain offensive content.]] Additional Disclaimer: Given the content of these prompts, keep in mind that you may want to check with your relevant legal department before trying them against LLMs.
Is it fair to say that only the unsafe ones are relevant?
Is it fair to say that only the unsafe ones are relevant?
Agreed, only the unsafe ones are relevant :)
@romanlutz if no one else is working on this I'd like to take this up!