FastChat
FastChat copied to clipboard
Evaluation on untruthful, harmful, toxic or sensitive questions
Hi Vicuna team,
Thanks for the great work to push the LLM fine-tuning a step further. this is especially amazing as a student/research led initiative.
I found most of the evaluation targets helpfulness situation. Do you have plan to evaluate on untruthful, harmful, toxic or sensitive questions? That is the main benefit from RLHF so I'm curious if simply supervised fine-tuning on a pre-aligned GPT could also inherit the human preference learned from RLHF in the original model (gpt 3.5).
@infwinston @suquark
Same question! Looking forward to the reply
Hey @ethanyanjiali great question! do you have any pointer or related reference on how people do such evaluation? is there public dataset?
Also do you mean designing some 'tricky' questions (or just general benchmarks with some known public datasets)? For example, there are some known tricks (e.g., "hypnosis") that works effectively on getting rid of safety checks.
@infwinston I used https://huggingface.co/datasets/Anthropic/hh-rlhf dataset when I do the RLHF in my project here: https://github.com/ethanyanjiali/minChatGPT I guess you can take the harmfulness part of this dataset to evaluate. That paper also discussed many other datasets for evaluation, see page 29 Section 6 of this: https://arxiv.org/pdf/2204.05862.pdf For example, PALMS dataset for sensitive questions.
For now, we have migrated our evaluation to MT-bench. In the short term, it seems we do not have a plan or capacity to investigate the model's performance on untruthful, harmful, toxic, or sensitive questions. So this thread becomes stale. Closing.