Open-Assistant
Open-Assistant copied to clipboard
Consider using OpenAI Evals
Being the only thing even tangentially related to GPT-4 made open-source, OpenAI Evals can help assess the performance of an LLM on a growing set of tasks. Maybe this could be used to measure how close to GPT-3.5 OpenAssistant models have got so far, help testing different models fine-tuned on it, or maybe even use it in the RL pipeline.
https://github.com/openai/evals
Yes I think such an assessment would make sense for our 30B LLaMA based model. If someone is interested in doing this please let us know.
Hi, I'm making an objective benchmark suite for open-source LLMs, which currently includes MMLU and BBH which were used in the GPT-4 paper. I'm really excited for OA and would like to evaluate your 30B LLaMA model, could you publish the weights (delta/lora/original versions) on the HuggingFace hub?
https://github.com/declare-lab/flan-eval
Hi, I'm making an objective benchmark suite for open-source LLMs, which currently includes MMLU and BBH which were used in the GPT-4 paper. I'm really excited for OA and would like to evaluate your 30B LLaMA model, could you publish the weights (delta/lora/original versions) on the HuggingFace hub?
![]()
https://github.com/declare-lab/flan-eval
OA model weights will start releasing from April 15th. I think we plan to release LLaMa 30B deltas on that date.
Great, looking forward to it :)
Shouldn't their evals be used for training not for evaluating?
I'm interested in this issue and have started working on it.
I have evaluated the OpenAssistant RLHF model and built a simple UI to view the scores and also the outputs because often the scores on their own can be misleading about the actual quality. The current version is over here: https://tju01.github.io/oasst-openai-evals/. Click on the corresponding task name to see the evaluation details for that specific task. I still have multiple ideas for improvements, but I have some questions before that related to the bad scores that the OpenAssistant model obtains.
- OpenAI evals heavily uses a system message for the model. While OpenAI GPT models can handle this just fine, I'm not really sure how to translate this to OpenAssistant models. I'm currently using the
<|system|>token since it seems likeoasst-rlhf-2-llama-30b-7k-stepswas trained with it (at least according to theadded_tokens.jsonfile), but I have doubts on whether that's how the<|system|>token is used in OpenAssistant models. Possible it would be better to use<|prefix_begin|>and<|prefix_end|>? Is that still used in the current models? - I have currently evaluated the
oasst-rlhf-2-llama-30b-7k-stepsmodel. Given that it is a RLHF model, I believe something like samplingnoutputs and choosing the one with the best reward according to the reward model might improve results. But I would need access to the corresponding reward model for that. Is that available somewhere? - I'm not sure how good the RLHF OpenAssistant model actually is. Maybe the SFT models are actually better right now? Which SFT model is the
oasst-rlhf-2-llama-30b-7k-stepsderived from and did the RLHF part actually improve other evaluations right now? I know that generally speaking it's important.
I've had my questions answered on the discord server. I have done the basic evaluation of multiple models, but there is lots of room for improvement. I'm going to continue at https://github.com/tju01/oasst-automatic-model-eval where I'm also going to add support for other evaluation benchmarks, see #1908.