tofu
tofu copied to clipboard
Which dataset should we use for evaluate?
which dataset config was used in leaderboard? Should I use forget10_perturbed or just forget10 or retain90?
If I use forget10 dataset, how to set perturbed_answer_key and eval_task?