opik icon indicating copy to clipboard operation
opik copied to clipboard

[Bug]: Custom Models not working in LLM as a Judge Evaluation technique.

Open Komal-99 opened this issue 9 months ago • 7 comments

What component(s) are affected?

  • [x] Python SDK
  • [ ] Opik UI
  • [ ] Opik Server
  • [x] Documentation

Opik version

  • Opik version: 1.4.8

Describe the problem

LLM as a Judge metric for hallucination and Answer relevance works fine with direct OpenAI integration but when trying to use it with a custom llm model from Huggingface or Ollama it throws an error that fails to calculate the metric score. This is how I am trying to load a custom model and as per documentation trying to pass it in metric functions. Image

and the below is the error that I always receive.

Image

Reproduction steps

Running an Experimental evaluation on any Opik project with a custom LLM model as an LLM Judge model. Ihave tried it with various different models but the same issue.

model1 = models.LiteLLMChatModel(model_name="ollama/gemma2:2b", base_url="http://localhost:11434")
@app.post("/run_evaluation/")
@backoff.on_exception(backoff.expo, (APIConnectionError, Exception), max_tries=3, max_time=300)
def run_evaluation():
    experiment_name = f"Deepseek_{dataset.name}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
    metrics = [Hallucination(model=model1), AnswerRelevance(model=model1)]
    try:
        evaluate(
            experiment_name=experiment_name,
            dataset=dataset,
            task=evaluation_task,
            scoring_metrics=metrics,
            experiment_config={"model": model},
            task_threads=2
        )
        return {"message": "Evaluation completed successfully"}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Komal-99 avatar Feb 23 '25 11:02 Komal-99

Hi @Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is

class HallucinationResponseFormat(pydantic.BaseModel):
    score: float
    reason: List[str]

In addition we also ask the model in the prompt to provide the answer in the following format

It is crucial that you provide your answer in the following JSON format:
{{
    "score": <your score between 0.0 and 1.0>,
    "reason": ["some reason 1", "some reason 2"]
}}

If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.

alexkuzmik avatar Feb 24 '25 14:02 alexkuzmik

Can you share which custom models are supported for this metric ? So that I can try it out if it supports or not

On Mon, Feb 24, 2025, 19:50 Aliaksandr Kuzmik @.***> wrote:

Hi @Komal-99 https://github.com/Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is

class HallucinationResponseFormat(pydantic.BaseModel): score: float reason: List[str]

In addition we also ask the model in the prompt to provide the answer in the following format

It is crucial that you provide your answer in the following JSON format: {{ "score": <your score between 0.0 and 1.0>, "reason": ["some reason 1", "some reason 2"] }}

If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.

— Reply to this email directly, view it on GitHub https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR22MFFAJZT5HXZKIZMGUBT2RMTEHAVCNFSM6AAAAABXWF7ZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZYGU3DIOJUGU . You are receiving this because you were mentioned.Message ID: @.***> [image: alexkuzmik]alexkuzmik left a comment (comet-ml/opik#1352) https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945

Hi @Komal-99 https://github.com/Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is

class HallucinationResponseFormat(pydantic.BaseModel): score: float reason: List[str]

In addition we also ask the model in the prompt to provide the answer in the following format

It is crucial that you provide your answer in the following JSON format: {{ "score": <your score between 0.0 and 1.0>, "reason": ["some reason 1", "some reason 2"] }}

If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.

— Reply to this email directly, view it on GitHub https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR22MFFAJZT5HXZKIZMGUBT2RMTEHAVCNFSM6AAAAABXWF7ZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZYGU3DIOJUGU . You are receiving this because you were mentioned.Message ID: @.***>

Komal-99 avatar Feb 24 '25 15:02 Komal-99

In an ideal world, if "response_format" in litellm.get_supported_openai_params("ollama/gemma2:2b") is True, then the model supports structured outputs and the metric should work fine. But sometimes the output is misleading and the parameter is in fact not supported. I'd recommend to open an issue in litellm repository for that, this is a pretty new model so maybe it will be fixed soon.

On that page you can find an information about the providers and models which support structured outputs.

alexkuzmik avatar Feb 24 '25 16:02 alexkuzmik

I’m having the same issue. @Komal-99, were you able to resolve it?

camucamulemon7 avatar Jun 11 '25 13:06 camucamulemon7

I am having the same issue. I can get Ollama to work using a proxy for the Playground but not as an Online Evaluation provider.

stugorf avatar Jun 12 '25 10:06 stugorf

@alexkuzmik

camucamulemon7 avatar Jun 23 '25 15:06 camucamulemon7

Possible duplicate of https://github.com/comet-ml/opik/issues/3220

vincentkoc avatar Nov 05 '25 06:11 vincentkoc