opik
opik copied to clipboard
[Bug]: Custom Models not working in LLM as a Judge Evaluation technique.
What component(s) are affected?
- [x] Python SDK
- [ ] Opik UI
- [ ] Opik Server
- [x] Documentation
Opik version
- Opik version: 1.4.8
Describe the problem
LLM as a Judge metric for hallucination and Answer relevance works fine with direct OpenAI integration but when trying to use it with a custom llm model from Huggingface or Ollama it throws an error that fails to calculate the metric score.
This is how I am trying to load a custom model and as per documentation trying to pass it in metric functions.
and the below is the error that I always receive.
Reproduction steps
Running an Experimental evaluation on any Opik project with a custom LLM model as an LLM Judge model. Ihave tried it with various different models but the same issue.
model1 = models.LiteLLMChatModel(model_name="ollama/gemma2:2b", base_url="http://localhost:11434")
@app.post("/run_evaluation/")
@backoff.on_exception(backoff.expo, (APIConnectionError, Exception), max_tries=3, max_time=300)
def run_evaluation():
experiment_name = f"Deepseek_{dataset.name}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
metrics = [Hallucination(model=model1), AnswerRelevance(model=model1)]
try:
evaluate(
experiment_name=experiment_name,
dataset=dataset,
task=evaluation_task,
scoring_metrics=metrics,
experiment_config={"model": model},
task_threads=2
)
return {"message": "Evaluation completed successfully"}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Hi @Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it.
For example for hallucination, the response format is
class HallucinationResponseFormat(pydantic.BaseModel):
score: float
reason: List[str]
In addition we also ask the model in the prompt to provide the answer in the following format
It is crucial that you provide your answer in the following JSON format:
{{
"score": <your score between 0.0 and 1.0>,
"reason": ["some reason 1", "some reason 2"]
}}
If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.
Can you share which custom models are supported for this metric ? So that I can try it out if it supports or not
On Mon, Feb 24, 2025, 19:50 Aliaksandr Kuzmik @.***> wrote:
Hi @Komal-99 https://github.com/Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is
class HallucinationResponseFormat(pydantic.BaseModel): score: float reason: List[str]
In addition we also ask the model in the prompt to provide the answer in the following format
It is crucial that you provide your answer in the following JSON format: {{ "score": <your score between 0.0 and 1.0>, "reason": ["some reason 1", "some reason 2"] }}
If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.
— Reply to this email directly, view it on GitHub https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR22MFFAJZT5HXZKIZMGUBT2RMTEHAVCNFSM6AAAAABXWF7ZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZYGU3DIOJUGU . You are receiving this because you were mentioned.Message ID: @.***> [image: alexkuzmik]alexkuzmik left a comment (comet-ml/opik#1352) https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945
Hi @Komal-99 https://github.com/Komal-99! We're using LiteLLMChatModel with response_format argument when the model supports it. For example for hallucination, the response format is
class HallucinationResponseFormat(pydantic.BaseModel): score: float reason: List[str]
In addition we also ask the model in the prompt to provide the answer in the following format
It is crucial that you provide your answer in the following JSON format: {{ "score": <your score between 0.0 and 1.0>, "reason": ["some reason 1", "some reason 2"] }}
If your model doesn't support structured outputs (==response_format argument), it may ignore the instruction from the prompt (classical LLM issue :). If that's the case, I'd recommend you to implement your own custom metric by inheriting it from BaseMetric. In your metric you can use our templates but relax the output format to follow your needs.
— Reply to this email directly, view it on GitHub https://github.com/comet-ml/opik/issues/1352#issuecomment-2678564945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR22MFFAJZT5HXZKIZMGUBT2RMTEHAVCNFSM6AAAAABXWF7ZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZYGU3DIOJUGU . You are receiving this because you were mentioned.Message ID: @.***>
In an ideal world, if "response_format" in litellm.get_supported_openai_params("ollama/gemma2:2b") is True, then the model supports structured outputs and the metric should work fine.
But sometimes the output is misleading and the parameter is in fact not supported. I'd recommend to open an issue in litellm repository for that, this is a pretty new model so maybe it will be fixed soon.
On that page you can find an information about the providers and models which support structured outputs.
I’m having the same issue. @Komal-99, were you able to resolve it?
I am having the same issue. I can get Ollama to work using a proxy for the Playground but not as an Online Evaluation provider.
@alexkuzmik
Possible duplicate of https://github.com/comet-ml/opik/issues/3220