faithfulness metric scoring, PromptValue has no len()
Question Using Azure OpenAI endpoint I've been able to define the Azure model like in https://github.com/explodinggradients/ragas/blob/main/docs/howtos/customisations/azure-openai.ipynb Q&A with chat completion works, however when trying to evaluate faithfulness I get an error in a langchain-core module which I can't quite figure out. Any help is appreciated
Code Examples
print(ragas_ds["eval"].features)
{'question': Value(dtype='string', id=None),
'answer': Value(dtype='string', id=None),
'ground_truths': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
# to be honest, I don't understand why these lengths are displayed as -1, ragas_ds["eval"]["ground_truths"] is not empty but a list[list] with about five lines of text in a single entry
'contexts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'expected_contexts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
#Calculate metric
eval["answer_context_faithfulness"] = faithfulness.score(ragas_ds["eval"])["faithfulness"][0]
Traceback (most recent call last):
File "/masked_part_of_file_path_for_security_reasons/./tests/evaluate_chat.py", line 333, in <module>
eval["answer_context_faithfulness"] = faithfulness.score(ragas_ds["eval"])[
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/ragas/metrics/base.py", line 74, in score
raise e
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/ragas/metrics/base.py", line 68, in score
score = asyncio.run(
^^^^^^^^^^^^
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/ragas/metrics/_faithfulness.py", line 180, in _ascore
answer_result = await self.llm.generate(
^^^^^^^^^^^^^^^^^^
File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 392, in generate
batch_size=len(messages),
^^^^^^^^^^^^^
TypeError: object of type 'PromptValue' has no len()
Additional context ragas v0.1.0 (updating from v0.0.18 or 0.0.19, yeah, it's been a while since I was able to work on this...)
Hey @kpeters @jjmachan will be looking into this issue shortly but in the mean time you can easily use the same metrics using evaluate function as shown here https://docs.ragas.io/en/stable/getstarted/evaluation.html
Now, with evaluate it's also easy to pass any custom llm / embedding as shown herehttps://docs.ragas.io/en/stable/howtos/customisations/index.html
Hey, it seems to me from reading your code that the code is not adapted to 0.1 version. If you need help with this feel free to schedule a call here @kpeters
Thanks @shahules786 , I'll have another look at it, but will get in touch if I need more help
seems that the evaluate function is more adapted to evaluate an entire dataset, what if i just want to score a single row ? Should we just create a dataset with 1 row ?
Thanks Shahules, Rewriting with the evaluate function worked, although it does feel a bit bloated to wrap a metric function in the evaluate function
@bryan-agicap that's what I'm currently doing, creating a dataset with 1 row
hey @bryan-agicap and @kpeters you could use the score() or ascore() function that is available for each metric.
will document this shortly as I'm working on revamping the documentation but here is the code section that defines the interface https://github.com/explodinggradients/ragas/blob/f6a932ad5bb7998bb5632c5dd60db0aa3b13ea65/src/ragas/metrics/base.py#L63-L95
let me know if you need any help
Amazing thank you!
I'm seeing this same issue. It would be great to be able to easily call each metric and then compute the aggregate score with a different method. I'm especially interested in this because I want to see bad scores per row.
Minimal code to repro:
from ragas.metrics import (
faithfulness
)
from langchain_openai import ChatOpenAI
faithfulness.llm = ChatOpenAI()
question = "What are the implications of the new policy?"
contexts = ["The new policy could change economic conditions."]
answer = "The policy will improve the economy."
ground_truths=["The policy will improve the economy."]
faithfulness.score(row={"question": question, "contexts": contexts, "answer": answer, 'ground_truths': ground_truths})
edit - for context: I'm trying to build an example for weave