yocto-gl
yocto-gl copied to clipboard
[BUG] When using Claude for llm as a judge model via bedrock response is NaN
Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the issues policy
Where did you encounter this bug?
Local machine
Willingness to contribute
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
MLflow version
- Client: 2.11.0
- Tracking server: 2.11.0
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 14.3.1
- Python version: 3.10.0
- yarn version, if running the dev UI:
Describe the problem
I am trying to use a custom LLM as judge model, but using endpoints:/completions doesn't work. The resulting metrics are NaN.
we get the following error when using any of the default metrics defined under mlflow.metrics.genai.metric_definitions
:
'Failed to score model on payload. Error: 422 Client Error: Unprocessable Entity for url: http://127.0.0.1:5000/endpoints/completions/invocations. Response text: {"detail":"Cannot set both 'temperature' and 'top_p' parameters. Please use only the temperature parameter for your query."}'
I believe this is due to mlflow.metrics.genai.prompts.v1.default_parameters
being hardcoded as:
default_parameters = { "temperature": 0.0, "max_tokens": 200, "top_p": 1.0, }
This does not seem to be compatible with Claude models provided via AWS Bedrock.
Changing this to the following does work:
default_parameters = { "temperature": 0.0, "max_tokens": 200, }
however, there is currently no friendly way to modify these default parameters at runtime through the API when initialising a metric function under mlflow.metrics.genai.metric_definitions
.
Making this possible e.g. by adding a parameters
arg to the API as follows could fix this:
from mlflow.metrics.genai import relevance
relevance_metric = relevance(model="endpoints:/completions", parameters={"temperature": 0.0})
i've set up my config.yaml
as follows:
endpoints:
- name: completions
endpoint_type: llm/v1/completions
model:
provider: bedrock
name: anthropic.claude-v2
config:
aws_config:
aws_secret_access_key: <redacted>
aws_access_key_id: <redacted
aws_region: us-east-1
limit:
renewal_period: minute
calls: 10
Deployed the server using:
mlflow deployments start-server --config-path config.yaml
Tracking information
REPLACE_ME
Code to reproduce issue
import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd
from mlflow.metrics.genai import relevance
set_deployments_target("http://127.0.0.1:5000/")
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.llms.bedrock import Bedrock
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = Bedrock(
region_name="us-east-1",
model_kwargs= {
"max_tokens_to_sample": 300,
"temperature": 0.1,
"top_k": 250,
"top_p": 0.1
},
model_id="anthropic.claude-v2:1",
streaming=False
)
chain = (
RunnablePassthrough()
| prompt
| llm
| StrOutputParser()
)
def model(input_df):
answers = []
for index, row in input_df.iterrows():
answer = chain.invoke({"question": row["questions"], "context": row["context"]})
answers.append(answer)
return answers
eval_df = pd.DataFrame(
{
"questions": [
"How does useEffect() work?",
"What does the static keyword in a function mean?",
"What does the 'finally' block in Python do?",
"What is the difference between multiprocessing and multithreading?",
],
"context": [
"The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
"Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
"'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
"Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
],
"ground_truth": [
"The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
"Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
"'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
"Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
],
}
)
MODEL = "endpoints:/completions"
relevance_metric = relevance(model=MODEL)
results = mlflow.evaluate(
model,
eval_df,
model_type="question-answering",
evaluators="default",
predictions="result",
extra_metrics=[relevance_metric, mlflow.metrics.latency()],
evaluator_config={
"col_mapping": {
"inputs": "questions",
"context": "context",
}
},
)
Stack trace
> results.tables["eval_results_table"]["relevance/v1/score"][0]
nan
> results.tables["eval_results_table"]["relevance/v1/justification"][0]
'Failed to score model on payload. Error: 422 Client Error: Unprocessable Entity for url: http://127.0.0.1:5000/endpoints/completions/invocations. Response text: {"detail":"Cannot set both \'temperature\' and \'top_p\' parameters. Please use only the temperature parameter for your query."}'
Other info / logs
What component(s) does this bug affect?
- [ ]
area/artifacts
: Artifact stores and artifact logging - [ ]
area/build
: Build and test infrastructure for MLflow - [X]
area/deployments
: MLflow Deployments client APIs, server, and third-party Deployments integrations - [ ]
area/docs
: MLflow documentation pages - [X]
area/examples
: Example code - [ ]
area/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registry - [ ]
area/models
: MLmodel format, model serialization/deserialization, flavors - [ ]
area/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templates - [ ]
area/projects
: MLproject format, project running backends - [X]
area/scoring
: MLflow Model server, model deployment tools, Spark UDFs - [ ]
area/server-infra
: MLflow Tracking server backend - [ ]
area/tracking
: Tracking Service, tracking client APIs, autologging
What interface(s) does this bug affect?
- [ ]
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev server - [ ]
area/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Models - [ ]
area/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registry - [ ]
area/windows
: Windows support
What language(s) does this bug affect?
- [ ]
language/r
: R APIs and clients - [ ]
language/java
: Java APIs and clients - [ ]
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/azure
: Azure and Azure ML integrations - [ ]
integrations/sagemaker
: SageMaker integrations - [ ]
integrations/databricks
: Databricks integrations
@daanalfa Thanks for reporting this issue! Great write-up! Adding a parameters
argument sounds good to me. @BenWilson2 What do you think?
Would the parameters arg update the defaults or replace them?
I think they should replace, otherwise we can't work around the issue
Right, and that would behave completely different to how other defaults are overridden with GenAI interfaces in MLflow. The underlying configuration issue that is causing the bug with Cohere should be fixed and then we could expose a parameter update logic as a separate PR to enable customization :)
Right, and that would behave completely different to how other defaults are overridden with GenAI interfaces in MLflow. The underlying configuration issue that is causing the bug with Cohere should be fixed and then we could expose a parameter update logic as a separate PR to enable customization :)
@BenWilson2 what's the bug with Cohere? Did you mean Claude?
The only way to solve the configuration issue without parameter update logic would be to modify the default parameters for the judge llm, but this would materially change the default judge LLM implementation for OpenAI models.
@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.
Hey everyone, is there an update on this? We have hit the same issue with other Bedrock models as well, so its not just a problem with Claude (We are using for the time being Titan-express)
Hi everyone, I am encountering a similar issue when trying to reference an Amazon Bedrock endpoint as my LLM judge, as described on this MLFlow documentation page - regardless of the underlying model provider. Have you had any update on this matter? Many thanks
Hello!
It's also happening for me. An update on this issue (or a workaround in the meantime) would be much appreciated.
Thank you, Laura
Hi, any updates? We have the same issue. You say that you support Amazon Bedrock here https://mlflow.org/docs/latest/llms/deployments/index.html#supported-provider-models:~:text=Supported%20Provider%20Models but it seems not true :(
Hi, any updates? We have the same issue. You say that you support Amazon Bedrock here https://mlflow.org/docs/latest/llms/deployments/index.html#supported-provider-models:~:text=Supported%20Provider%20Models but it seems not true :(
OK, this is a workaround (very tricky but working, at least for now)
import mlflow.metrics.genai.model_utils as mutils
orig_score_model_on_payload = mutils.score_model_on_payload
def wrapper_score_model_on_payload(model_uri, payload, eval_parameters=None):
return orig_score_model_on_payload(model_uri, payload, {'temperature': 0.0, 'max_tokens': 200}) # override parameters as third argument
mutils.score_model_on_payload = wrapper_score_model_on_payload
Not for production, but anyway