Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Model evaluation

Open sanagno opened this issue 1 year ago • 20 comments

We need a better evaluation pipeline to better quantify model performance and compare models with each other. Some ideas include

  • Evaluating on dataset for which we already have the chatGPT reference, e.g. HC3.
  • Using a fine-tuned reward model.

sanagno avatar Feb 27 '23 20:02 sanagno

If we want to test knowledge/retrieval etc, there are some good evaluations on things like 'wizard of wikipedia' or 'wizard of the internet. If we want to test conversational skills there is 'blender skill talk' or 'convai2'. There are also things like 'Empathetic dialogue' to test empathy for instance

sanagno avatar Mar 01 '23 20:03 sanagno

Adding instructions evaluation dataset to this thread: https://arxiv.org/pdf/2204.07705.pdf

pruksmhc avatar Mar 02 '23 00:03 pruksmhc

I'm interested to help with this, but have my hands full for the next week or so.

pruksmhc avatar Mar 02 '23 00:03 pruksmhc

Is this task being worked on right now?

totuta avatar Mar 14 '23 20:03 totuta

So as far as I know :)

sanagno avatar Mar 14 '23 21:03 sanagno

Do you still need a contributor?

adendek avatar Mar 14 '23 21:03 adendek

yes, please, If anyone is interested I can assign you to it

sanagno avatar Mar 14 '23 21:03 sanagno

I'll start working on this tomorrow, but I don't want to stop or delay anyone who wants to contribute as well.

adendek avatar Mar 16 '23 11:03 adendek

For using RM to evaluate LLM results I have pushed a framework to do this in a sister repo https://github.com/Open-Assistant/oasst-model-eval/tree/main/model_eval/scoring. I can replicate the same here if that's something we consider useful. @sanagno

shahules786 avatar Mar 17 '23 04:03 shahules786

@sanagno @adendek hey, i'm interested in help you guys with this. how can i help?

xrsrke avatar Mar 17 '23 05:03 xrsrke

Can someone like @sanagno confirm if the code provided by @shahules786 is sufficient? You @ shahules786 can create PR which can be reviewed. If this happens, I will take care of a different task.


However, my initial plan was as follows:

  • [x] Clone the repo.
  • [ ] Try to build and run the model on my own hardware or docker.
  • [ ] Finish reading project documentation, and contribution page.
  • [ ] Find the location where the code should be located.
  • [ ] Implement some abstractions and expandable pipelines to ensure we can easily extend the set of evaluation metrics and methods.
  • [ ] Implement initial unit tests, since TDD rocks.
  • [x] Review the paper provided by @pruksmhc to find an initial set of metrics to be used.
  • [ ] implement evaluation metrics and create work-in-progress PR to get quick feedback from the community.

@xrsrke if you want to participate you can start on your own if you have any thoughts you can share them in this thread. Any help and suggestion will be appreciated.


Let me know if this makes any sense.

Adam

adendek avatar Mar 17 '23 09:03 adendek

Thanks @adendek. I will start reading the paper tomorrow, follow @adendek 's initial plan and share my progress here!

xrsrke avatar Mar 17 '23 09:03 xrsrke

After checking out the Open-Assistant and oasst-model-eval repository (suggested by @shahules786), I found that we currently only have a sampling report and no evaluation on datasets.

So, my initial plan is to create evaluation scripts that can be extended to different datasets and metrics. @adendek, we should consider splitting the work to avoid duplication of effort.


My summary of the paper suggested by pruksmhc The paper provides a dataset for evaluating generalization in RLHF model. They used ROUGE-L as an evaluation metric

The dataset contains 1,616 NLP task with natural language instructions and 76 broad task types spanning 55 different languages

The evaluation pipeline in the paper

  1. They compute the ROUGE-L of the model output and the reference
  2. Then conduct a human evaluation to assess the quality of the generated outputs

xrsrke avatar Mar 18 '23 04:03 xrsrke

@xrsrke great job! I updated the road map from my previous message. The point regarding the paper review can be considered done.

Now, the highest priority is to understand the open assistant part. We need to find where the evaluation code should be placed, and how to get a model to validate it versus the tasks. We need a method or class that will implement the following interface:

import torch
from torch.utils.data import Dataset

def evaluation_task(model: torch.nn.Module, dataset: Dataset) -> float:
  """ 
   Evaluates the model versus a given task represented as a PyTorch Dataset. 
  
  Parameters
  ----------
   model : torch.nn.Module
         the model to be evaluated. 
   dataset :  Dataset
         instance of a PyTorch dataset that contains the data for a given evaluation task.
  """

  implementation goes here.

Do we use numpy-like or google docstrings?


evaluation scripts that can be extended to different datasets and metrics

I'd rather say we need a little extendable framework, rather than a script. I prefer the OOP approach with proper interfaces and unit tests. There is also an issue with task orchestration.

Do we need to create our solution or leverage an existing one like airflow, kedros, or Luigi? Where this evaluation pipeline should be evaluated? GitHub workflow?


What do you think about this?

adendek avatar Mar 18 '23 09:03 adendek

Hey the code of @shahules786 seems to be taking care of the second part of the evaluation, that is great.

Regarding the first point, i.e evaluating on some known datasets your suggestions seems great.

I think the most suitable location for the code is under /model/model_eval

sanagno avatar Mar 18 '23 09:03 sanagno

@adendek @sanagno I can raise the PR for using RM to evaluate RLHF fine-tuned model. There is some refactoring going on with reward-model training #2071

shahules786 avatar Mar 18 '23 16:03 shahules786

@shahules786 sure do this.

And I will take care of the rest of the framework. You should expect a WIP (work in progress) PR very soon, like Tuesday or Wednesday.

adendek avatar Mar 19 '23 17:03 adendek

@adendek Something just came up with my work, and I won't be able to join anymore. Sorry for the inconvenience.

xrsrke avatar Mar 19 '23 21:03 xrsrke

@sanagno @adendek I'm waiting for the RM training team to release loadable model weights. This might take some days. I am anyway through with the rest of the code. As soon as the model is ready I'll raise the PR.

shahules786 avatar Mar 20 '23 15:03 shahules786

Reward model based evaluation is now available thanks @shahules786.

@adendek, are you working on other benchmarks, e.g. evaluating the instruction following capabilities?

andreaskoepf avatar May 05 '23 11:05 andreaskoepf

I'm going to work on this now over at https://github.com/tju01/oasst-automatic-model-eval. I started with the OpenAI evals framework which also had its own issue #2348, but I'm going to extend that code to also cover over benchmarks now.

tju01 avatar May 20 '23 17:05 tju01

@tju has generated https://tju01.github.io/ilm-eval/

andreaskoepf avatar Jun 14 '23 09:06 andreaskoepf