safe-rlhf topic
List
safe-rlhf repositories
safe-rlhf
1.3k
Stars
119
Forks
Watchers
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
beavertails
105
Stars
3
Forks
Watchers
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
adversarial-reinforcement-learning
91
Stars
5
Forks
Watchers
Reading list for adversarial perspective and robustness in deep reinforcement learning.