opik [Bug]: Custom Models not working in LLM as a Judge Evaluation technique.

What component(s) are affected?

[x] Python SDK
[ ] Opik UI
[ ] Opik Server
[x] Documentation

Opik version

Opik version: 1.4.8

Describe the problem

LLM as a Judge metric for hallucination and Answer relevance works fine with direct OpenAI integration but when trying to use it with a custom llm model from Huggingface or Ollama it throws an error that fails to calculate the metric score. This is how I am trying to load a custom model and as per documentation trying to pass it in metric functions.

and the below is the error that I always receive.

Reproduction steps

Running an Experimental evaluation on any Opik project with a custom LLM model as an LLM Judge model. Ihave tried it with various different models but the same issue.

model1 = models.LiteLLMChatModel(model_name="ollama/gemma2:2b", base_url="http://localhost:11434")
@app.post("/run_evaluation/")
@backoff.on_exception(backoff.expo, (APIConnectionError, Exception), max_tries=3, max_time=300)
def run_evaluation():
    experiment_name = f"Deepseek_{dataset.name}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
    metrics = [Hallucination(model=model1), AnswerRelevance(model=model1)]
    try:
        evaluate(
            experiment_name=experiment_name,
            dataset=dataset,
            task=evaluation_task,
            scoring_metrics=metrics,
            experiment_config={"model": model},
            task_threads=2
        )
        return {"message": "Evaluation completed successfully"}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Feb 23 '25 11:02 Komal-99

Hi @Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is

class HallucinationResponseFormat(pydantic.BaseModel):
    score: float
    reason: List[str]

In addition we also ask the model in the prompt to provide the answer in the following format

It is crucial that you provide your answer in the following JSON format:
{{
    "score": <your score between 0.0 and 1.0>,
    "reason": ["some reason 1", "some reason 2"]
}}

If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.

Feb 24 '25 14:02 alexkuzmik

Can you share which custom models are supported for this metric ? So that I can try it out if it supports or not

On Mon, Feb 24, 2025, 19:50 Aliaksandr Kuzmik @.***> wrote:

Hi @Komal-99 https://github.com/Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is

class HallucinationResponseFormat(pydantic.BaseModel): score: float reason: List[str]

In addition we also ask the model in the prompt to provide the answer in the following format

It is crucial that you provide your answer in the following JSON format: {{ "score": <your score between 0.0 and 1.0>, "reason": ["some reason 1", "some reason 2"] }}

If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.

— Reply to this email directly, view it on GitHub https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR22MFFAJZT5HXZKIZMGUBT2RMTEHAVCNFSM6AAAAABXWF7ZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZYGU3DIOJUGU . You are receiving this because you were mentioned.Message ID: @.***> [image: alexkuzmik]alexkuzmik left a comment (comet-ml/opik#1352) https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945

Hi @Komal-99 https://github.com/Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is

class HallucinationResponseFormat(pydantic.BaseModel): score: float reason: List[str]

In addition we also ask the model in the prompt to provide the answer in the following format

It is crucial that you provide your answer in the following JSON format: {{ "score": <your score between 0.0 and 1.0>, "reason": ["some reason 1", "some reason 2"] }}

If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.

— Reply to this email directly, view it on GitHub https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR22MFFAJZT5HXZKIZMGUBT2RMTEHAVCNFSM6AAAAABXWF7ZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZYGU3DIOJUGU . You are receiving this because you were mentioned.Message ID: @.***>

Feb 24 '25 15:02 Komal-99

In an ideal world, if "response_format" in litellm.get_supported_openai_params("ollama/gemma2:2b") is True, then the model supports structured outputs and the metric should work fine. But sometimes the output is misleading and the parameter is in fact not supported. I'd recommend to open an issue in litellm repository for that, this is a pretty new model so maybe it will be fixed soon.

On that page you can find an information about the providers and models which support structured outputs.

Feb 24 '25 16:02 alexkuzmik

I’m having the same issue. @Komal-99, were you able to resolve it?

Jun 11 '25 13:06 camucamulemon7

I am having the same issue. I can get Ollama to work using a proxy for the Playground but not as an Online Evaluation provider.

Jun 12 '25 10:06 stugorf

@alexkuzmik

Jun 23 '25 15:06 camucamulemon7

Possible duplicate of https://github.com/comet-ml/opik/issues/3220

Nov 05 '25 06:11 vincentkoc

opik opik copied to clipboard

[Bug]: Custom Models not working in LLM as a Judge Evaluation technique.

What component(s) are affected?

Opik version

Describe the problem

Reproduction steps

opik
opik copied to clipboard