llm-safety-benchmark topics

440

Stars

39

Forks

Watchers

An easy-to-use Python framework to generate adversarial jailbreak prompts.

25

Stars

1

Forks

Watchers

Restore safety in fine-tuned language models through task arithmetic

30

Stars

7

Forks

Watchers

Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"