error: RagasEvaluatorChain with non-openai models
Describe the bug
Cannot get the evaluation with Langchain QA Chains using non-openai models.
Ragas version: 0.0.21 Python version: 3.10.0
Code to Reproduce
# list of metrics we're going to use
metrics = [
answer_relevancy,
faithfulness,
context_precision,
context_recall,
]
for m in metrics:
m.__setattr__("llm", bedrock_llm)
answer_relevancy_chain = RagasEvaluatorChain(metric=metrics[0])
eval_result = answer_relevancy_chain(result)
eval_result["answer_relevancy_score"]
Error trace
Traceback (most recent call last):
>>> eval_result = answer_relevancy_chain(result)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/python3.10/site-packages/langchain/chains/base.py", line 312, in __call__
raise e
File "/python3.10/site-packages/langchain/chains/base.py", line 306, in __call__
self._call(inputs, run_manager=run_manager)
File "/python3.10/site-packages/ragas/langchain/evalchain.py", line 70, in _call
score = self.metric.score_single(
File "/python3.10/site-packages/ragas/metrics/base.py", line 101, in score_single
score = self._score_batch(
File "/python3.10/site-packages/ragas/metrics/answer_relevance.py", line 87, in _score_batch
results = self.llm.generate(
File "/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 370, in generate
raise e
File "/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 360, in generate
self._generate_with_cache(
File "/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 514, in _generate_with_cache
return self._generate(
File "/python3.10/site-packages/langchain/chat_models/bedrock.py", line 99, in _generate
prompt = ChatPromptAdapter.convert_messages_to_prompt(
File "/python3.10/site-packages/langchain/chat_models/bedrock.py", line 30, in convert_messages_to_prompt
prompt = convert_messages_to_prompt_anthropic(messages=messages)
File "/python3.10/site-packages/langchain/chat_models/anthropic.py", line 64, in convert_messages_to_prompt_anthropic
text = "".join(
File "/python3.10/site-packages/langchain/chat_models/anthropic.py", line 65, in <genexpr>
_convert_one_message_to_text(message, human_prompt, ai_prompt)
File "/python3.10/site-packages/langchain/chat_models/anthropic.py", line 31, in _convert_one_message_to_text
content = cast(str, message.content)
AttributeError: 'tuple' object has no attribute 'content'
Expected behavior It should execute all ragas metrics without raising an error.
is this working for other metrics or is everything else failing as well?
I think the RagasEvaluatorChain has now been out of sync. I'll look into this more
Thanks! it is failing with all the metrics
will fix this but is this blocking your work right now?
It can wait for a few days. It would be more helpful to add custom prompts to the Ragas framework. Especially, the Claude version.
Hi @arm-diaz , thanks for your patience. For support across models like Claude, we recently changed prompts so that it works fine irrespective of the model. Some of the PRs is yet to be merged but would love to get your feedback.
Only answer relevancy is running okay for me with "pip install git+https://github.com/explodinggradients/ragas.git".
if I install ragas==0.21.0, the code breaks when running the metric context recall, but the other metrics look fine. Although, I cannot tell if there is an error with the results, I would need to analyze more carefully.
I had similar problems with the context_recall metric as described earlier. I may have some pointers into what may be going wrong. In the definition of the class ContextRecall (in /ragas/metrics/_context_recall.py), the output of the LLM seems to be truncated so the result is not valid JSON. In addition, Claude seems to prepend the string: "Here are the classifications for the sentences in the answer" which breaks the eval(response[0]) call.
I was able to make it work without errors (but giving a NaN response for context_recall) by changing the loop at line 120 in the following manner. But this is brittle and will probably have other issues with generated content. But some things seem like an oversight, so pointing out here.
for response in responses:
pattern = "\[\s*\{.*?\}(\s*,\s*\{.*?\})*\s*\]"
match = re.search(pattern, response[0].replace("\n", ""))
if match:
# response = eval(response[0])
response = eval(match.group(0))
denom = len(response)
numerator = sum(
item.get("Attributed", item.get("attributed")).lower() == "yes" for item in response
)
scores.append(numerator / denom)
else:
scores.append(np.nan)
The existence of the pattern variable seems to point at the possibility that this was an attempt to pull out the JSON from the response, even if the LLM did not return only the JSON as expected.
I also noticed that sometimes the JSON will contain the key "Attributed" and sometimes "attributed" so I added a second check.
Finally, in many cases I noticed that the response is truncated, i.e. the LLM stops generating chars before it closes the JSON response.
Steps to repeat my observations:
- Python 3.11.4
- Ragas 0.0.21, LangChain 0.0.348, boto3 1.33.11
- LLM is
anthropic-claude-v2on AWS Bedrock - trying to replicate code from https://docs.ragas.io/en/latest/howtos/customisations/aws-bedrock.html
I think the prompt might need to be updated or the BedrockChat.generate (part of LangChain) be configured to output more characters or use a stop sequence. Also my understanding is that Claude is better at outputting formatted XML, so not sure if that is something that might work better here.
Hi @sujitpal , thanks for sharing your observations. I did some experiments and raised a fix the JSON formating issue (tested with claude) #364 Can you install ragas from the source and try again?
Thank you for the quick turnaround @shahules786 much appreciated. With installing ragas from source, the context_recall runs without error and returns a floating point number (not a NaN). However, faithfulness and answer_relevancy are now broken at least with my setup as detailed in my previous message. The other two metrics context_precision and harmfulness also return correctly.
FWIW, the errors I get from faithfulness are as follows:
File [~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:178](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:178), in <genexpr>(.0)
[175](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:175) output = json_loader.safe_load(output[0].text, self.llm)
[176](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:176) output = output if output else []
[177](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:177) faithful_statements = sum(
--> [178](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:178) verdict_score_map.get(dict.get("verdict", "").lower(), np.nan)
[179](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:179) for dict in output
[180](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:180) )
[181](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:181) num_statements = len(output)
[182](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/Projects/apollo-prototype/external/ragas/src/ragas/metrics/_faithfulness.py:182) if num_statements:
AttributeError: 'str' object has no attribute 'get'
and the errors I get from answer_relevancy:
File [~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:146](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:146), in BedrockEmbeddings._embedding_func(self, text)
[144](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:144) return response_body.get("embedding")
[145](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:145) except Exception as e:
--> [146](https://file+.vscode-resource.vscode-cdn.net/Users/palsujit/Projects/apollo-prototype/notebooks/~/.virtualenvs/apollo/lib/python3.11/site-packages/langchain/embeddings/bedrock.py:146) raise ValueError(f"Error raised by inference endpoint: {e}")
ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: expected minLength: 1, actual: 0, please reformat your input and try again
In any case, I am going to try using GPT instead as I understand that it may be better supported. I had chosen Claude-v2 as that was what our application used but it may make sense to have it be evaluated by something else.
Hi @sujitpal , I just raised a PR for resolving this issue #374 . Sorry for the inconvenience.
Related to Claude I notice a strange issue where it randomly ends the generation while doing longer predictions, Please let me know if you face any such issues. Other than that, I have verified that metrics are working for claude.
No worries, many thanks for looking at it. Unfortunately, it is still failing on answer_relevancy with the following error.
ValueError: Error raised by inference endpoint: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: expected minLength: 1, actual: 0, please reformat your input and try again.
I have attached my test code in case it is useful, it is mostly adapted from Ragas documentation pages. I was hoping that if this works, I will switch out the fiqa dataset with my own (already available in JSON with minor tweaks) and convert to HF dataset using load_dataset and have the beginnings of an automated evaluation framework thanks to Ragas :-).
WRT the issue you mention, I saw it too as I mentioned earlier. I have only used Claude extensively for prompting, but found that if you specify a stop sequence without specifying a max output length, it does not terminate mid-generation. Although TBH my prompts were not as large and detailed as the ones used in Ragas. Here is some detail about this in case it is useful. In my prompts, I would end them something like this (last line of prompt):
classification: {
and specify "stop_sequence": "}". (I used classification: <response> and "stop_sequence": "</response>" in our application since we observed that Claude handles XML well).
claude = boto3.client(
service_name="bedrock",
region_name=os.environ["AWS_REGION_NAME"],
endpoint_url=os.environ["AWS_BEDROCK_ENDPOINT_URL"]
request_body = {
"prompt": prompt,
"max_tokens_to_sample": 1000,
"temperature": 0.0,
"stop_sequences": ["}"]
}
params = {
"modelId": model,
"contentType": "application/json",
"accept": "*/*",
"body": json.dumps(request_body)
}
response = claude.invoke_model(**params)
response_body = response["body"].read().decode("utf-8")
Then we would add "{" and "}" to the response before we would call eval on it. (again in our case we used XML so we would add <response> and </response> and pass the output to an XML parser).
@sujitpal I was getting the same error as in the above comment. I changed the BedrockEmbedding model_id to cohere.embed-english-v3 and its working. So the issue is with the Amazon titan embedding endpoint (Which is the default ).
On a side note, its worth deep diving internally what prompt is being used that causes this behavior with the titan endpoint.
@pka32 thank you I will try using the Cohere embedding instead. I was using Claude but it looks like the ragas code is more tuned for OpenAI's JSON output formatting skills, and Claude seems somewhat deficient there (its better with XML based on my experience)
@sujitpal Oh actually Anthropic does not even provide embeddings as of now. Checkout the "Can Claude do embeddings?" FAQ in https://www.anthropic.com/product. So as of now I found Cohere to be the best alternative in Bedrock. Hopefully Bedrock gets some good open-source alternatives for embeddings.
Good point, sorry for misunderstanding, I was assuming that it was just the base LLM (Claude in my case) that was returning the error. Definitely makes sense now that it would come up when we ask it to generate embeddings. I will try Cohere as you suggested.