unitxt icon indicating copy to clipboard operation
unitxt copied to clipboard

Unify llm judges into a single prepare file

Open martinscooper opened this issue 9 months ago • 3 comments

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

  • judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
  • all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
  • It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
  • it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

martinscooper avatar Mar 21 '25 16:03 martinscooper

This PR moves judges in prepare/metrics/llm_as_judge/direct/llama_3_3_70b_instruct_adherence_completeness.py to prepare/metrics/llm_as_judge/llm_as_judge.py so that:

  • judges and the underlying inference engine are created using the same inference engine/judge parameter set (for example: temperature = 0)
  • all new llm judge approach are created in the same file (it is a bit cumbersome to have to run multiple files when doing changes to the artifacts).
  • It also moves the criteria definitions adherence_with_format and answer_completeness to llm_as_judge_contants.py
  • it uses criteria's catalog name instead of the object.

@lilacheden the default context fields of the adherence metric's instructions field seems a bit too specific.

"context_fields": {
    "question": "question",
    "instructions": "metadata/template/instruction",
}

Do you think we could simplify it?

@elronbandel I tried setting those context fields values using the square bracket notation but it says it is marlformed. Could you remind me if dictionaries are supported there?

Hi @martinscooper ,

  1. what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.

  2. It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.

  3. However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly. It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

lilacheden avatar Mar 24 '25 15:03 lilacheden

Hi @martinscooper ,

  1. what do you mean by "too specific"? the judge requires the instructions of the original prompt, and this is where they can be found (at least for the relevant task). I don't know of any general way to get that.
  2. It makes sense to always create the judges from the catalog, from now on I will always use the registered llm judge for creation and just override the criteria/context fields/any other desired attributes instead of creating a new judge.
  3. However, I'm not sure all llm judges and criteria should be prepared and stored together - maybe it's better to have a public catalog for everyone with the suggested criteria, and a private catalog (and separate preparation scripts) where each user (like myself) can create his own esoteric criteria and judges, just as it can be created by users on the fly. It can help for example If someone wants to use a criteria similar to a one in the public catalog, but describe it differently according to his own use case.

How does that sound to you?

@lilacheden

  1. With specific I mean that I think the instruction entry of the context fields should have a simpler field name by default. I would set the context_fields just as a list, which probably users wouldn't have to change:
{
  ...,
  "context_fields":  ["question", "instruction"],
  ...,
}

Then, if a user needs a more specific source from where the context field should be taken from, they could specify it manually for their use case.

  1. Agree.

  2. Sounds good. It is true that it is not that important to have all registered judges in the same file. Then I would move the judges back to its original file. Do you agree on calling get_evaluator_metadata() so that the params are consistent across all the judges?

martinscooper avatar Apr 01 '25 16:04 martinscooper

@yoavkatz @elronbandel I applied the fix.

martinscooper avatar Apr 09 '25 12:04 martinscooper