safe-rlhf topic

List safe-rlhf repositories

safe-rlhf

1.3k
Stars
119
Forks
Watchers

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

beavertails

105
Stars
3
Forks
Watchers

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

adversarial-reinforcement-learning

91
Stars
5
Forks
Watchers

Reading list for adversarial perspective and robustness in deep reinforcement learning.