alpaca_eval Overly High Win Rate for Alpaca v2 on mistral 7b orpo

Hey Team,

We're running some experiments with mistral 7b orpo and variants, but found that using GPT-4-1106-preview as baseline + openai gpt-4 judgement produce overly high results:

INFO:root:Not saving the result to the cached leaderboard because precomputed_leaderboard is not a path but <class 'NoneType'>.
                                 length_controlled_winrate  win_rate  standard_error  n_total  avg_length
mistral-7b-orpo                     36.53                    54.43            1.76      802        1610

Alpaca 1.0 results are normal (~87%, 88% v.s. text davinci 003) so model config and output seems to be good I guess? Would you mind providing some suggestions on what mistakes I might make when running alpaca 2.0? (one more side thing is the GLM regression gives an warning WARNING:root:Length controlled win rate is very different from the raw one: 36.5% vs 54.4%. This might be a sign of failure of the GLM. so not sure if it's a big problem to fix as well.

My command to run it is alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --reference_model_configs 'gpt4_1106_preview' --annotators_config 'alpaca_eval_gpt4'

Thank you!

Best regards, QQ

May 12 '24 20:05 qingquansong

This is very surprising indeed. Just to understand, why are you not using the default alpaca_eval 2?

i.e. alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo'

Is the issue that you don't have access to logprobs? if so can you run

alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_clf_gpt4_turbo'

May 12 '24 21:05 YannDubs

Hey @YannDubs thank you so much for the prompt response! We're using some in-house supported openAI API call to call GPT-4 as judgement (which is roughly the same as the original code) and works well for Alpaca 1.0. For 2.0, we don't use turbo (I guess it's only for cost saving purpose?) and directly use gpt-4, the annotations look like:

{ "dataset":"helpful_base", "instruction":"What are the names of some famous actors that started their careers on Broadway?", "output_1":"1. Hugh Jackman: He is known for his roles in films, but he got his start in theater. He won a Tony Award for his role in \"The Boy from Oz.\"\n\n2. Idina Menzel: She started her career on Broadway, starring in shows like \"Rent\" and \"Wicked.\" She later gained fame for her role as Elsa in Disney's \"Frozen.\"\n\n3. Lin-Manuel Miranda: He is best known for creating and starring in the Broadway musicals \"In The Heights\" and \"Hamilton.\"\n\n4. Kristin Chenoweth: She started her career on Broadway in shows like \"You're a Good Man, Charlie Brown\" and \"Wicked.\"\n\n5. Sarah Jessica Parker: Before she was known for \"Sex and the City,\" she was a Broadway actress, appearing in shows like \"Annie.\"\n\n6. Matthew Broderick: He has had a successful film career, but he got his start on Broadway and has won two Tony Awards.\n\n7. Neil Patrick Harris: He is known for his roles in TV shows, but he has also had a successful career on Broadway, starring in shows like \"Hedwig and the Angry Inch.\"\n\n8. Nathan Lane: He is a Broadway legend, known for his roles in \"The Producers,\" \"Guys and Dolls,\" and \"The Lion King.\"\n\n9. Lea Michele: Before her breakout role on \"Glee,\" she was a Broadway actress, starring in \"Spring Awakening.\"\n\n10. Bernadette Peters: She is one of Broadway's most critically acclaimed performers, known for her roles in \"Into the Woods,\" \"Sunday in the Park with George,\" and \"Gypsy.\"", "generator_1":"gpt4_1106_preview", "output_2":"Broadway has been a launching pad for many famous actors, some of whom include:\n\n1. Meryl Streep - She made her Broadway debut in 1975 in the play \"The Playboy of Seville.\"\n2. Al Pacino - He made his Broadway debut in 1969 in the play \"Does a Tiger Wear a Necktie?\"\n3. Robert De Niro - He made his Broadway debut in 1969 in the play \"The Playroom.\"\n4. Denzel Washington - He made his Broadway debut in 1988 in the play \"Checkmates.\"\n5. Philip Seymour Hoffman - He made his Broadway debut in 1990 in the play \"As Bees in Honey Drown.\"\n6. Scarlett Johansson - She made her Broadway debut in 2010 in the play \"A View from the Bridge.\"\n7. Kevin Kline - He made his Broadway debut in 1972 in the play \"The Pirates of Penzance.\"\n8. Glenn Close - She made her Broadway debut in 1974 in the play \"Love for Love.\"\n9. Nathan Lane - He made his Broadway debut in 1982 in the play \"Laughter on the 23rd Floor.\"\n10. Audra McDonald - She made her Broadway debut in 1994 in the play \"Carousel.\"\n\nThese are just a few examples of the many actors who have started their careers on Broadway. The theater has a long history of launching the careers of talented performers.", "generator_2":"mistral-7b-orpo", "annotator":"alpaca_eval_gpt4", "preference":2.0, "time_per_example":35.4098809306, "raw_completion":"[\n {'model': 'model_1', 'rank': 1},\n {'model': 'model_2', 'rank': 2}\n]", "price_per_example":0.0 },

Do I need to collect any log prob from it to compute the alpaca eval 2 output or this annotation.json file is good enough? (the output is 805 (smaller sometimes due to some invalid calling results such as 802 shown in the above messages) when using this which I'm not sure if we should ideally get 805 * 2 output to switch the order for judgement as well.) Thank you!

QQ

May 12 '24 21:05 qingquansong

This is very surprising indeed. Just to understand, why are you not using the default alpaca_eval 2?

i.e. alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo'

Is the issue that you don't have access to logprobs? if so can you run

alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_clf_gpt4_turbo'

Oh，got your point. it seems the temperature are different （0 and 1）as well as several certain other configs in the two files let me try that. Thank you！

One question I'm confused is that the clf config seems require the logprob of m and M for calibration，is this correct？So it means if I'm using this config + logprob parser，i need to be able to get logprob from API even they're not used later in metric compute （besides desiding m or M） thank you！

May 12 '24 21:05 qingquansong

Is there also a way to not using the logprob api with alpaca_eval_clf_gpt4_turbo if we cannot access the logprob such as still using the alpaca_eval_gpt4 annotator? Thank you!

Best regards, QQ

May 13 '24 20:05 qingquansong

Hey @YannDubs , it's unfortunately that I cannot adopt the API to access the logprob in our case, and seems like alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_clf_gpt4_turbo' also need to access logprob? do you happen to know if it's possible to avoid using it such as using something similar to the standard gpt-4 config for Alpaca Eval 2.0? Or maybe it's not possible not to do 2.0 and have to use 1.0. Thank you!

Best regards, QQ

May 15 '24 22:05 qingquansong

My bad @qingquansong , use

alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_gpt4_turbo_fn' which doesn't require logprobs

May 17 '24 05:05 YannDubs

Hi, may I check when you finetune mistral 7b are you using mistralai/Mistral-7B-Instruct-v0.2 checkpoint? Because when I run alpaca_eval on the outputs of mistral instruct v2, I got a high win rate too.

  length_controlled_winrate  win_rate  standard_error  n_total  avg_length
   38.05                     27.20         1.57           805        1594

Update:

I somehow manage to resolve the high win rate problem by using another azure deployment name and api key during the evaluation but I am not sure the reason why it works. When I check the annotation.json generated, both specifies that the annotator model is gpt4-1106. Sorry for the confusion caused

May 29 '24 10:05 hungchiayu1

Hey @YannDubs wanna reopen this issue. When I'm using `alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_gpt4_turbo_fn' as suggested. The results is still high ~31-36 length_controlled win rate and ~50 win rate (not matter using gpt4 turbo or gpt4 annotator and with gpt4 1106 reference results) I'm using the mistral-orpo-beta which should be ~11-12. Do you have any other suggestions? Thanks.

May 31 '24 07:05 qingquansong

@hungchiayu1 that's very surprising, what are the two deployment names and how do they differ?

May 31 '24 19:05 YannDubs

@qingquansong are you using the OpenAI API directly?

My guess in all the above is that the issue comes from using the wrong models & API deployment. PLease run it with OpenAI API and let me know if you still have the issues.

May 31 '24 19:05 YannDubs

Hey @YannDubs Thank you very much for the response! I'm pretty sure we're using the OpenAI API (though is in-house version from Microsoft) and we tried both GPT-4 and GPT-4 turbo and both provide the same high results. Do you mind having a run for mistral-orpo-beta with the run you suggested alpaca_eval evaluate_from_model --model_configs 'mistral-7b-orpo' --annotators_config 'alpaca_eval_gpt4_turbo_fn' and probably have an oracle win-rate for us to reproduce? Thank you!

May 31 '24 21:05 qingquansong

Hey @YannDubs, from what I understand the 2 deployment differ in a way that one of them is gpt4-32k and one of them is gpt-4.

Jun 01 '24 06:06 hungchiayu1