FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Evaluation on untruthful, harmful, toxic or sensitive questions

Open ethanyanjiali opened this issue 1 year ago • 5 comments

Hi Vicuna team,

Thanks for the great work to push the LLM fine-tuning a step further. this is especially amazing as a student/research led initiative.

I found most of the evaluation targets helpfulness situation. Do you have plan to evaluate on untruthful, harmful, toxic or sensitive questions? That is the main benefit from RLHF so I'm curious if simply supervised fine-tuning on a pre-aligned GPT could also inherit the human preference learned from RLHF in the original model (gpt 3.5).

ethanyanjiali avatar Apr 01 '23 00:04 ethanyanjiali

@infwinston @suquark

merrymercy avatar Apr 05 '23 07:04 merrymercy

Same question! Looking forward to the reply

S1s-Z avatar Apr 06 '23 03:04 S1s-Z

Hey @ethanyanjiali great question! do you have any pointer or related reference on how people do such evaluation? is there public dataset?

infwinston avatar Apr 06 '23 03:04 infwinston

Also do you mean designing some 'tricky' questions (or just general benchmarks with some known public datasets)? For example, there are some known tricks (e.g., "hypnosis") that works effectively on getting rid of safety checks.

suquark avatar Apr 06 '23 03:04 suquark

@infwinston I used https://huggingface.co/datasets/Anthropic/hh-rlhf dataset when I do the RLHF in my project here: https://github.com/ethanyanjiali/minChatGPT I guess you can take the harmfulness part of this dataset to evaluate. That paper also discussed many other datasets for evaluation, see page 29 Section 6 of this: https://arxiv.org/pdf/2204.05862.pdf For example, PALMS dataset for sensitive questions.

ethanyanjiali avatar Apr 07 '23 18:04 ethanyanjiali

For now, we have migrated our evaluation to MT-bench. In the short term, it seems we do not have a plan or capacity to investigate the model's performance on untruthful, harmful, toxic, or sensitive questions. So this thread becomes stale. Closing.

zhisbug avatar Jul 05 '23 19:07 zhisbug