lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

feat: Add Weight & Biases logging and reporting

Open parambharat opened this issue 1 year ago • 15 comments

This PR add functionality to log evaluation results to Weight & Biases.

In addition to logging evaluation runs and tables we also add functionality to auto generate an easily sharable Weight & Biases Report.

An example auto generated report can be seen here

To install this functionality please run pip install "lm_eval[wandb]"

parambharat avatar Nov 25 '22 06:11 parambharat

This looks like a great start, and is something that’s been on my mental to-do list for quite a while.

One thing we often do is evaluate several models on the same datasets, and then wish to compile a report comparing them. If you could add a function that you call on a set of reports that produces the combined report, that would greatly increase the value added to people using this library.

Here’s an example of the kind of plot I have in mind.

B0AB3BD4-2C4F-4ABE-9302-9488B4ABBBB0

StellaAthena avatar Nov 25 '22 19:11 StellaAthena

@StellaAthena : Thanks for the input. Currently, the integration logs a single run and creates a report for an evaluation run linked to a specific model and it's relevant tasks. I love what you have suggested and was thinking along the same lines. However, to achieve that we will need access to either existing runs and reports (formatted in a a specific way) or logged evaluation tables that can be compiled into the right kind of graphs in a report. I feel with these in place - comparing different models/reports will become possible. I will work on this feature and raise a PR for that soon.

parambharat avatar Nov 27 '22 15:11 parambharat

@parambharat The Eval Harness has default local logging that should provide all the necessary information. Alternatively, for a purely web-based solution that may work better in a distributed setting perhaps we can assume that your WandB logging code is used to log evaluation results to WandB project and that goal is to compare (groups of) runs across the project? This sounds sufficiently similar to the original usecase for WandB reports that it seems like it should be pretty easy to implement.

StellaAthena avatar Nov 28 '22 19:11 StellaAthena

@parambharat any updates on this?

StellaAthena avatar Dec 26 '22 23:12 StellaAthena

Hi @StellaAthena , yes this is WIP with a few changes pending. I'm currently on vacation and will be back by the end of the week. I'll commit the changes as soon as I'm back. Sorry for not dropping the update here earlier.

parambharat avatar Dec 27 '22 02:12 parambharat

Wonderful! Can’t wait to see it.

Have a nice vacation.

StellaAthena avatar Dec 27 '22 02:12 StellaAthena

How does this sound? @parambharat

satpalsr avatar Dec 27 '22 07:12 satpalsr

@satpalsr. Thank you for the link to your colab and the corresponding workspace. It was really helpful in moving this PR forward in the direction that @StellaAthena had initially wanted. @StellaAthena : Let me know if the changes I've pushed works. Due to current constraints in the wandb ReportsAPI I was only able to add the comparison plots to a run. And add a link to the run in the report. The necessary plots can be imported into the report easily. I have added a how-to Note for this in the auto-generated report. Here's the sample report: https://wandb.ai/parambharat/lm_eval/reports/-2023-01-03-09-06-02-Model-comparison-report--VmlldzozMjU0Mzg5 and the corresponding run : https://wandb.ai/parambharat/lm_eval/runs/1awvscu4

parambharat avatar Jan 03 '23 09:01 parambharat

Hi @StellaAthena : Do you have any feedback on this ? Can you please take a look and see if this works ?

parambharat avatar Jan 17 '23 08:01 parambharat

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

:white_check_mark: parambharat
:white_check_mark: StellaAthena
:x: Bharat Ramanathan


Bharat Ramanathan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Apr 23 '23 02:04 CLAassistant

Hi is there any progress on this?

mkeoliya avatar Dec 10 '23 02:12 mkeoliya

We're moving towards supporting Zeno for visualizations. But feel free to reopen this if you think it's still worth doing. Also note that we've changed our default branch.

lintangsutawika avatar Dec 12 '23 14:12 lintangsutawika

@lintangsutawika i don't see any harm in supporting both, and lots of people (including at EleutherAI!) regularly use WandB for logging.

StellaAthena avatar Dec 12 '23 16:12 StellaAthena

@mkeoliya I am currently at NeurIIPs but have looking at this on my to-do list for next week. One thing is that we just pushed a major backend update. I assume some changes (hopefully minor) will be necessary to ensure compatibility with main. Would you be able to do a pass at addressing that?

StellaAthena avatar Dec 12 '23 16:12 StellaAthena

@mkeoliya I caught COVID at NeurIPS and am unlikely to get to this before EOY.

StellaAthena avatar Dec 20 '23 19:12 StellaAthena

Closing as superseded by https://github.com/EleutherAI/lm-evaluation-harness/pull/1339

StellaAthena avatar Jan 25 '24 20:01 StellaAthena