yocto-gl icon indicating copy to clipboard operation
yocto-gl copied to clipboard

[BUG] When using Claude for llm as a judge model via bedrock response is NaN

Open daanalfa opened this issue 11 months ago • 8 comments

Issues Policy acknowledgement

  • [X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

  • Client: 2.11.0
  • Tracking server: 2.11.0

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 14.3.1
  • Python version: 3.10.0
  • yarn version, if running the dev UI:

Describe the problem

I am trying to use a custom LLM as judge model, but using endpoints:/completions doesn't work. The resulting metrics are NaN.

we get the following error when using any of the default metrics defined under mlflow.metrics.genai.metric_definitions:

'Failed to score model on payload. Error: 422 Client Error: Unprocessable Entity for url: http://127.0.0.1:5000/endpoints/completions/invocations. Response text: {"detail":"Cannot set both 'temperature' and 'top_p' parameters. Please use only the temperature parameter for your query."}'

I believe this is due to mlflow.metrics.genai.prompts.v1.default_parameters being hardcoded as:

default_parameters = { "temperature": 0.0, "max_tokens": 200, "top_p": 1.0, } This does not seem to be compatible with Claude models provided via AWS Bedrock. Changing this to the following does work:

default_parameters = { "temperature": 0.0, "max_tokens": 200, }

however, there is currently no friendly way to modify these default parameters at runtime through the API when initialising a metric function under mlflow.metrics.genai.metric_definitions.

Making this possible e.g. by adding a parameters arg to the API as follows could fix this:

from mlflow.metrics.genai import relevance
relevance_metric = relevance(model="endpoints:/completions", parameters={"temperature": 0.0})

i've set up my config.yaml as follows:

endpoints:
  - name: completions
    endpoint_type: llm/v1/completions
    model:
      provider: bedrock
      name: anthropic.claude-v2
      config:
        aws_config: 
          aws_secret_access_key: <redacted>
          aws_access_key_id: <redacted
          aws_region: us-east-1
    limit:
      renewal_period: minute
      calls: 10

Deployed the server using:

mlflow deployments start-server --config-path config.yaml

Tracking information

REPLACE_ME

Code to reproduce issue

import mlflow
from mlflow.deployments import set_deployments_target
import pandas as pd
from mlflow.metrics.genai import relevance

set_deployments_target("http://127.0.0.1:5000/")


from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.llms.bedrock import Bedrock


template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

llm =  Bedrock(
      region_name="us-east-1",
      model_kwargs= {
                    "max_tokens_to_sample": 300,
                    "temperature": 0.1,
                    "top_k": 250,
                    "top_p": 0.1
                },
      model_id="anthropic.claude-v2:1",
      streaming=False
      )
chain = (
    RunnablePassthrough()
    | prompt 
    | llm
    | StrOutputParser()
)

def model(input_df):
    answers = []
    for index, row in input_df.iterrows():
        answer = chain.invoke({"question": row["questions"], "context": row["context"]})
        answers.append(answer)
    return answers
    
eval_df = pd.DataFrame(
    {
        "questions": [
            "How does useEffect() work?",
            "What does the static keyword in a function mean?",
            "What does the 'finally' block in Python do?",
            "What is the difference between multiprocessing and multithreading?",
        ],
        "context": [
            "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
            "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
            "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
            "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
        ],
        "ground_truth": [
            "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
            "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
            "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
            "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
        ],
    }
)

MODEL = "endpoints:/completions"

relevance_metric = relevance(model=MODEL)

results = mlflow.evaluate(
    model,
    eval_df,
    model_type="question-answering",
    evaluators="default",
    predictions="result",
    extra_metrics=[relevance_metric, mlflow.metrics.latency()],
    evaluator_config={
        "col_mapping": {
            "inputs": "questions",
            "context": "context",
        }
    },
)

Stack trace

> results.tables["eval_results_table"]["relevance/v1/score"][0]
nan

> results.tables["eval_results_table"]["relevance/v1/justification"][0]
'Failed to score model on payload. Error: 422 Client Error: Unprocessable Entity for url: http://127.0.0.1:5000/endpoints/completions/invocations. Response text: {"detail":"Cannot set both \'temperature\' and \'top_p\' parameters. Please use only the temperature parameter for your query."}'

Other info / logs

What component(s) does this bug affect?

  • [ ] area/artifacts: Artifact stores and artifact logging
  • [ ] area/build: Build and test infrastructure for MLflow
  • [X] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • [ ] area/docs: MLflow documentation pages
  • [X] area/examples: Example code
  • [ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • [ ] area/models: MLmodel format, model serialization/deserialization, flavors
  • [ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • [ ] area/projects: MLproject format, project running backends
  • [X] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • [ ] area/server-infra: MLflow Tracking server backend
  • [ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • [ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • [ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • [ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • [ ] area/windows: Windows support

What language(s) does this bug affect?

  • [ ] language/r: R APIs and clients
  • [ ] language/java: Java APIs and clients
  • [ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • [ ] integrations/azure: Azure and Azure ML integrations
  • [ ] integrations/sagemaker: SageMaker integrations
  • [ ] integrations/databricks: Databricks integrations

daanalfa avatar Mar 04 '24 10:03 daanalfa

@daanalfa Thanks for reporting this issue! Great write-up! Adding a parameters argument sounds good to me. @BenWilson2 What do you think?

harupy avatar Mar 05 '24 05:03 harupy

Would the parameters arg update the defaults or replace them?

BenWilson2 avatar Mar 05 '24 22:03 BenWilson2

I think they should replace, otherwise we can't work around the issue

daniellok-db avatar Mar 06 '24 00:03 daniellok-db

Right, and that would behave completely different to how other defaults are overridden with GenAI interfaces in MLflow. The underlying configuration issue that is causing the bug with Cohere should be fixed and then we could expose a parameter update logic as a separate PR to enable customization :)

BenWilson2 avatar Mar 06 '24 00:03 BenWilson2

Right, and that would behave completely different to how other defaults are overridden with GenAI interfaces in MLflow. The underlying configuration issue that is causing the bug with Cohere should be fixed and then we could expose a parameter update logic as a separate PR to enable customization :)

@BenWilson2 what's the bug with Cohere? Did you mean Claude?

The only way to solve the configuration issue without parameter update logic would be to modify the default parameters for the judge llm, but this would materially change the default judge LLM implementation for OpenAI models.

daanalfa avatar Mar 06 '24 09:03 daanalfa

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

github-actions[bot] avatar Mar 12 '24 00:03 github-actions[bot]

Hey everyone, is there an update on this? We have hit the same issue with other Bedrock models as well, so its not just a problem with Claude (We are using for the time being Titan-express)

revolutionisme avatar Apr 17 '24 15:04 revolutionisme

Hi everyone, I am encountering a similar issue when trying to reference an Amazon Bedrock endpoint as my LLM judge, as described on this MLFlow documentation page - regardless of the underlying model provider. Have you had any update on this matter? Many thanks

JahedZ avatar Jun 18 '24 15:06 JahedZ

Hello!

It's also happening for me. An update on this issue (or a workaround in the meantime) would be much appreciated.

Thank you, Laura

laurazpm avatar Jul 02 '24 11:07 laurazpm

Hi, any updates? We have the same issue. You say that you support Amazon Bedrock here https://mlflow.org/docs/latest/llms/deployments/index.html#supported-provider-models:~:text=Supported%20Provider%20Models but it seems not true :(

livanitskyi avatar Aug 29 '24 08:08 livanitskyi

Hi, any updates? We have the same issue. You say that you support Amazon Bedrock here https://mlflow.org/docs/latest/llms/deployments/index.html#supported-provider-models:~:text=Supported%20Provider%20Models but it seems not true :(

OK, this is a workaround (very tricky but working, at least for now)

import mlflow.metrics.genai.model_utils as mutils
orig_score_model_on_payload = mutils.score_model_on_payload

def wrapper_score_model_on_payload(model_uri, payload, eval_parameters=None):
    return orig_score_model_on_payload(model_uri, payload, {'temperature': 0.0, 'max_tokens': 200}) # override parameters as third argument 

mutils.score_model_on_payload = wrapper_score_model_on_payload

Not for production, but anyway

livanitskyi avatar Aug 30 '24 07:08 livanitskyi