Open-Assistant
Open-Assistant copied to clipboard
Model evaluation
We need a better evaluation pipeline to better quantify model performance and compare models with each other. Some ideas include
- Evaluating on dataset for which we already have the chatGPT reference, e.g. HC3.
- Using a fine-tuned reward model.
If we want to test knowledge/retrieval etc, there are some good evaluations on things like 'wizard of wikipedia' or 'wizard of the internet. If we want to test conversational skills there is 'blender skill talk' or 'convai2'. There are also things like 'Empathetic dialogue' to test empathy for instance
Adding instructions evaluation dataset to this thread: https://arxiv.org/pdf/2204.07705.pdf
I'm interested to help with this, but have my hands full for the next week or so.
Is this task being worked on right now?
So as far as I know :)
Do you still need a contributor?
yes, please, If anyone is interested I can assign you to it
I'll start working on this tomorrow, but I don't want to stop or delay anyone who wants to contribute as well.
For using RM to evaluate LLM results I have pushed a framework to do this in a sister repo https://github.com/Open-Assistant/oasst-model-eval/tree/main/model_eval/scoring. I can replicate the same here if that's something we consider useful. @sanagno
@sanagno @adendek hey, i'm interested in help you guys with this. how can i help?
Can someone like @sanagno confirm if the code provided by @shahules786 is sufficient? You @ shahules786 can create PR which can be reviewed. If this happens, I will take care of a different task.
However, my initial plan was as follows:
- [x] Clone the repo.
- [ ] Try to build and run the model on my own hardware or docker.
- [ ] Finish reading project documentation, and contribution page.
- [ ] Find the location where the code should be located.
- [ ] Implement some abstractions and expandable pipelines to ensure we can easily extend the set of evaluation metrics and methods.
- [ ] Implement initial unit tests, since TDD rocks.
- [x] Review the paper provided by @pruksmhc to find an initial set of metrics to be used.
- [ ] implement evaluation metrics and create work-in-progress PR to get quick feedback from the community.
@xrsrke if you want to participate you can start on your own if you have any thoughts you can share them in this thread. Any help and suggestion will be appreciated.
Let me know if this makes any sense.
Adam
Thanks @adendek. I will start reading the paper tomorrow, follow @adendek 's initial plan and share my progress here!
After checking out the Open-Assistant
and oasst-model-eval
repository (suggested by @shahules786), I found that we currently only have a sampling report and no evaluation on datasets.
So, my initial plan is to create evaluation scripts that can be extended to different datasets and metrics. @adendek, we should consider splitting the work to avoid duplication of effort.
My summary of the paper suggested by pruksmhc
The paper provides a dataset for evaluating generalization in RLHF model. They used ROUGE-L as an evaluation metric
The dataset contains 1,616 NLP task with natural language instructions and 76 broad task types spanning 55 different languages
The evaluation pipeline in the paper
- They compute the ROUGE-L of the model output and the reference
- Then conduct a human evaluation to assess the quality of the generated outputs
@xrsrke great job! I updated the road map from my previous message. The point regarding the paper review can be considered done.
Now, the highest priority is to understand the open assistant part. We need to find where the evaluation code should be placed, and how to get a model to validate it versus the tasks. We need a method or class that will implement the following interface:
import torch
from torch.utils.data import Dataset
def evaluation_task(model: torch.nn.Module, dataset: Dataset) -> float:
"""
Evaluates the model versus a given task represented as a PyTorch Dataset.
Parameters
----------
model : torch.nn.Module
the model to be evaluated.
dataset : Dataset
instance of a PyTorch dataset that contains the data for a given evaluation task.
"""
implementation goes here.
Do we use numpy-like or google docstrings?
evaluation scripts that can be extended to different datasets and metrics
I'd rather say we need a little extendable framework, rather than a script. I prefer the OOP approach with proper interfaces and unit tests. There is also an issue with task orchestration.
Do we need to create our solution or leverage an existing one like airflow, kedros, or Luigi? Where this evaluation pipeline should be evaluated? GitHub workflow?
What do you think about this?
Hey the code of @shahules786 seems to be taking care of the second part of the evaluation, that is great.
Regarding the first point, i.e evaluating on some known datasets your suggestions seems great.
I think the most suitable location for the code is under /model/model_eval
@adendek @sanagno I can raise the PR for using RM to evaluate RLHF fine-tuned model. There is some refactoring going on with reward-model
training #2071
@shahules786 sure do this.
And I will take care of the rest of the framework. You should expect a WIP (work in progress) PR very soon, like Tuesday or Wednesday.
@adendek Something just came up with my work, and I won't be able to join anymore. Sorry for the inconvenience.
@sanagno @adendek I'm waiting for the RM training team to release loadable model weights. This might take some days. I am anyway through with the rest of the code. As soon as the model is ready I'll raise the PR.
Reward model based evaluation is now available thanks @shahules786.
@adendek, are you working on other benchmarks, e.g. evaluating the instruction following capabilities?
I'm going to work on this now over at https://github.com/tju01/oasst-automatic-model-eval. I started with the OpenAI evals framework which also had its own issue #2348, but I'm going to extend that code to also cover over benchmarks now.
@tju has generated https://tju01.github.io/ilm-eval/