Yann Dubois comments

Results 52 comments of


                                            Yann Dubois

Possibility of adding a version signature

Todo: - [ ] add a signature in the csv leaderboard so that people can report them in papers and make sure it's comparable. - [ ] print the signature...

Possibility of adding a version signature

I'm closing as I don't think that I will be able to add this feature unfortunately.

Overly High Win Rate for Alpaca v2 on mistral 7b orpo

This is very surprising indeed. Just to understand, why are you not using the default alpaca_eval 2? i.e. `alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo'` Is the issue that you don't have access...

Overly High Win Rate for Alpaca v2 on mistral 7b orpo

My bad @qingquansong , use alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_gpt4_turbo_fn' which doesn't require logprobs

Overly High Win Rate for Alpaca v2 on mistral 7b orpo

@hungchiayu1 that's very surprising, what are the two deployment names and how do they differ?

Overly High Win Rate for Alpaca v2 on mistral 7b orpo

@qingquansong are you using the OpenAI API directly? My guess in all the above is that the issue comes from using the wrong models & API deployment. PLease run it...

Reproducing the results on transfer tasks

That's strange... The model that I used to get table 5 (ie 93.6 CIFAR10) is `dissl_resnet50_d8192_e400_m6`, to check that you can reproduce the results?

[New Task] Add AlpacaEval LC

I saw the PR, it looks great and homogeneity definitely makes sense. Adding AlpacaEval might require a few changes for homogenization though. The pipeline for AlpacaEval at a high level...

[New Task] Add AlpacaEval LC

Great, to know that there's a place for a corpus level function, I can write a minimal `length_controlled_mean` when the times come. Let me know if you have questions for...

[New Task] Add AlpacaEval LC

Hey @clefourrier! So the current JudgeOpenAI still seems pretty specialized to MT-bench. E.g. it makes a few assumptions that will not be true for AlpacaEval and more generally for other...