ragas error: RagasEvaluatorChain with non-openai models

Describe the bug

Cannot get the evaluation with Langchain QA Chains using non-openai models.

Ragas version: 0.0.21 Python version: 3.10.0

Code to Reproduce

# list of metrics we're going to use
metrics = [
    answer_relevancy,
    faithfulness,
    context_precision,
    context_recall,
]

for m in metrics:
    m.__setattr__("llm", bedrock_llm)

answer_relevancy_chain = RagasEvaluatorChain(metric=metrics[0])
eval_result = answer_relevancy_chain(result)
eval_result["answer_relevancy_score"]

Error trace

Traceback (most recent call last):
>>> eval_result = answer_relevancy_chain(result)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/python3.10/site-packages/langchain/chains/base.py", line 312, in __call__
    raise e
  File "/python3.10/site-packages/langchain/chains/base.py", line 306, in __call__
    self._call(inputs, run_manager=run_manager)
  File "/python3.10/site-packages/ragas/langchain/evalchain.py", line 70, in _call
    score = self.metric.score_single(
  File "/python3.10/site-packages/ragas/metrics/base.py", line 101, in score_single
    score = self._score_batch(
  File "/python3.10/site-packages/ragas/metrics/answer_relevance.py", line 87, in _score_batch
    results = self.llm.generate(
  File "/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 370, in generate
    raise e
  File "/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 360, in generate
    self._generate_with_cache(
  File "/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 514, in _generate_with_cache
    return self._generate(
  File "/python3.10/site-packages/langchain/chat_models/bedrock.py", line 99, in _generate
    prompt = ChatPromptAdapter.convert_messages_to_prompt(
  File "/python3.10/site-packages/langchain/chat_models/bedrock.py", line 30, in convert_messages_to_prompt
    prompt = convert_messages_to_prompt_anthropic(messages=messages)
  File "/python3.10/site-packages/langchain/chat_models/anthropic.py", line 64, in convert_messages_to_prompt_anthropic
    text = "".join(
  File "/python3.10/site-packages/langchain/chat_models/anthropic.py", line 65, in <genexpr>
    _convert_one_message_to_text(message, human_prompt, ai_prompt)
  File "/python3.10/site-packages/langchain/chat_models/anthropic.py", line 31, in _convert_one_message_to_text
    content = cast(str, message.content)
AttributeError: 'tuple' object has no attribute 'content'

Expected behavior It should execute all ragas metrics without raising an error.

Nov 29 '23 17:11 arm-diaz

is this working for other metrics or is everything else failing as well?

I think the RagasEvaluatorChain has now been out of sync. I'll look into this more

Nov 30 '23 07:11 jjmachan

Thanks! it is failing with all the metrics

Nov 30 '23 09:11 arm-diaz

will fix this but is this blocking your work right now?

Nov 30 '23 14:11 jjmachan

It can wait for a few days. It would be more helpful to add custom prompts to the Ragas framework. Especially, the Claude version.

Nov 30 '23 15:11 arm-diaz

Hi @arm-diaz , thanks for your patience. For support across models like Claude, we recently changed prompts so that it works fine irrespective of the model. Some of the PRs is yet to be merged but would love to get your feedback.

Dec 01 '23 03:12 shahules786

Only answer relevancy is running okay for me with "pip install git+https://github.com/explodinggradients/ragas.git".

Dec 01 '23 15:12 arm-diaz

if I install ragas==0.21.0, the code breaks when running the metric context recall, but the other metrics look fine. Although, I cannot tell if there is an error with the results, I would need to analyze more carefully.

Dec 01 '23 16:12 arm-diaz

I had similar problems with the context_recall metric as described earlier. I may have some pointers into what may be going wrong. In the definition of the class ContextRecall (in /ragas/metrics/_context_recall.py), the output of the LLM seems to be truncated so the result is not valid JSON. In addition, Claude seems to prepend the string: "Here are the classifications for the sentences in the answer" which breaks the eval(response[0]) call.

I was able to make it work without errors (but giving a NaN response for context_recall) by changing the loop at line 120 in the following manner. But this is brittle and will probably have other issues with generated content. But some things seem like an oversight, so pointing out here.

            for response in responses:
                pattern = "\[\s*\{.*?\}(\s*,\s*\{.*?\})*\s*\]"
                match = re.search(pattern, response[0].replace("\n", ""))
                if match:
                    # response = eval(response[0])
                    response = eval(match.group(0))
                    denom = len(response)
                    numerator = sum(
                        item.get("Attributed", item.get("attributed")).lower() == "yes" for item in response
                    )
                    scores.append(numerator / denom)
                else:
                    scores.append(np.nan)

The existence of the pattern variable seems to point at the possibility that this was an attempt to pull out the JSON from the response, even if the LLM did not return only the JSON as expected.

I also noticed that sometimes the JSON will contain the key "Attributed" and sometimes "attributed" so I added a second check.

Finally, in many cases I noticed that the response is truncated, i.e. the LLM stops generating chars before it closes the JSON response.

Steps to repeat my observations:

Python 3.11.4
Ragas 0.0.21, LangChain 0.0.348, boto3 1.33.11
LLM is anthropic-claude-v2 on AWS Bedrock
trying to replicate code from https://docs.ragas.io/en/latest/howtos/customisations/aws-bedrock.html

I think the prompt might need to be updated or the BedrockChat.generate (part of LangChain) be configured to output more characters or use a stop sequence. Also my understanding is that Claude is better at outputting formatted XML, so not sure if that is something that might work better here.

Dec 11 '23 22:12 sujitpal

Hi @sujitpal , thanks for sharing your observations. I did some experiments and raised a fix the JSON formating issue (tested with claude) #364 Can you install ragas from the source and try again?

Dec 12 '23 04:12 shahules786

Thank you for the quick turnaround @shahules786 much appreciated. With installing ragas from source, the context_recall runs without error and returns a floating point number (not a NaN). However, faithfulness and answer_relevancy are now broken at least with my setup as detailed in my previous message. The other two metrics context_precision and harmfulness also return correctly.

FWIW, the errors I get from faithfulness are as follows:

File [~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:178](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:178), in <genexpr>(.0)
    [175](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:175) output = json_loader.safe_load(output[0].text, self.llm)
    [176](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:176) output = output if output else []
    [177](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:177) faithful_statements = sum(
--> [178](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:178)     verdict_score_map.get(dict.get("verdict", "").lower(), np.nan)
    [179](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:179)     for dict in output
    [180](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:180) )
    [181](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:181) num_statements = len(output)
    [182](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:182) if num_statements:

AttributeError: 'str' object has no attribute 'get'

and the errors I get from answer_relevancy:

File [~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:146](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:146), in BedrockEmbeddings._embedding_func(self, text)
    [144](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:144)         return response_body.get("embedding")
    [145](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:145) except Exception as e:
--> [146](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:146)     raise ValueError(f"Error raised by inference endpoint: {e}")

ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: expected minLength: 1, actual: 0, please reformat your input and try again

In any case, I am going to try using GPT instead as I understand that it may be better supported. I had chosen Claude-v2 as that was what our application used but it may make sense to have it be evaluated by something else.

Dec 12 '23 15:12 sujitpal

Hi @sujitpal , I just raised a PR for resolving this issue #374 . Sorry for the inconvenience.
Related to Claude I notice a strange issue where it randomly ends the generation while doing longer predictions, Please let me know if you face any such issues. Other than that, I have verified that metrics are working for claude.

Dec 12 '23 16:12 shahules786

No worries, many thanks for looking at it. Unfortunately, it is still failing on answer_relevancy with the following error.

ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: expected minLength: 1, actual: 0, please reformat your input and try again.

I have attached my test code in case it is useful, it is mostly adapted from Ragas documentation pages. I was hoping that if this works, I will switch out the fiqa dataset with my own (already available in JSON with minor tweaks) and convert to HF dataset using load_dataset and have the beginnings of an automated evaluation framework thanks to Ragas :-).

test.py.txt

WRT the issue you mention, I saw it too as I mentioned earlier. I have only used Claude extensively for prompting, but found that if you specify a stop sequence without specifying a max output length, it does not terminate mid-generation. Although TBH my prompts were not as large and detailed as the ones used in Ragas. Here is some detail about this in case it is useful. In my prompts, I would end them something like this (last line of prompt):

classification: {

and specify "stop_sequence": "}". (I used classification: <response> and "stop_sequence": "</response>" in our application since we observed that Claude handles XML well).

    claude = boto3.client(
        service_name="bedrock",
        region_name=os.environ["AWS_REGION_NAME"],
        endpoint_url=os.environ["AWS_BEDROCK_ENDPOINT_URL"]
    request_body = {
        "prompt": prompt,
        "max_tokens_to_sample": 1000,
        "temperature": 0.0,
        "stop_sequences": ["}"]
    }
    params = {
        "modelId": model,
        "contentType": "application/json",
        "accept": "*/*",
        "body": json.dumps(request_body)
    }
    response = claude.invoke_model(**params)
    response_body = response["body"].read().decode("utf-8")

Then we would add "{" and "}" to the response before we would call eval on it. (again in our case we used XML so we would add <response> and </response> and pass the output to an XML parser).

Dec 13 '23 21:12 sujitpal

@sujitpal I was getting the same error as in the above comment. I changed the BedrockEmbedding model_id to cohere.embed-english-v3 and its working. So the issue is with the Amazon titan embedding endpoint (Which is the default ).

On a side note, its worth deep diving internally what prompt is being used that causes this behavior with the titan endpoint.

Jan 05 '24 06:01 pka32

@pka32 thank you I will try using the Cohere embedding instead. I was using Claude but it looks like the ragas code is more tuned for OpenAI's JSON output formatting skills, and Claude seems somewhat deficient there (its better with XML based on my experience)

Jan 05 '24 17:01 sujitpal

@sujitpal Oh actually Anthropic does not even provide embeddings as of now. Checkout the "Can Claude do embeddings?" FAQ in https://www.anthropic.com/product. So as of now I found Cohere to be the best alternative in Bedrock. Hopefully Bedrock gets some good open-source alternatives for embeddings.

Jan 05 '24 18:01 pka32

Good point, sorry for misunderstanding, I was assuming that it was just the base LLM (Claude in my case) that was returning the error. Definitely makes sense now that it would come up when we ask it to generate embeddings. I will try Cohere as you suggested.

Jan 05 '24 19:01 sujitpal