Evaluation specifics
Hi!
I'm trying to evaluate Mistral-7b based model with custom locality and portability data. For each of 50 edits I have 6 locality prompts and 2 portability ones.
How should I arange the dicts to feed them into an edit function in that case? Will the variable below feeded to portability_inputs work as intended?
portability_inputs = {
'english': {
'prompt': df_port['question_en'].tolist(),
'ground_truth': df_port['label_en'].tolist()
},
'polish': {
'prompt': df_port['question_pl'].tolist(),
'ground_truth': df_port['label_pl'].tolist()
}
}
And a technical one, are the metrics calculated after each edit? If yes, is there an option to evaluate everything on the final model after 50 sequential edits?
Thank you :)
Q1:
-
Your usage is correct; just ensure that the number of items in the
promptsandground_truthunder each dimension, such as "english" and "polish," are consistent. -
You can also check if the number of metrics recorded in the logs matches the number of input prompts.
Q2:
- I haven't implemented this feature yet, which allows for unified evaluation after full editing, but you can refer to the pseudocode in this #220. I will improve this feature in the next version. Thank you!
Hi, do you have any further questions?
Nothing as of now, thanks :)