ragas faithfulness metric scoring, PromptValue has no len()

Question Using Azure OpenAI endpoint I've been able to define the Azure model like in https://github.com/explodinggradients/ragas/blob/main/docs/howtos/customisations/azure-openai.ipynb Q&A with chat completion works, however when trying to evaluate faithfulness I get an error in a langchain-core module which I can't quite figure out. Any help is appreciated

Code Examples

print(ragas_ds["eval"].features)
{'question': Value(dtype='string', id=None), 
'answer': Value(dtype='string', id=None), 
'ground_truths': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),   
# to be honest, I don't understand why these lengths are displayed as -1, ragas_ds["eval"]["ground_truths"] is not empty but a list[list] with about five lines of text in  a single entry
'contexts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 
'expected_contexts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}

#Calculate metric
eval["answer_context_faithfulness"] = faithfulness.score(ragas_ds["eval"])["faithfulness"][0]
Traceback (most recent call last):
  File "/masked_part_of_file_path_for_security_reasons/./tests/evaluate_chat.py", line 333, in <module>
    eval["answer_context_faithfulness"] = faithfulness.score(ragas_ds["eval"])[
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/ragas/metrics/base.py", line 74, in score
    raise e
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/ragas/metrics/base.py", line 68, in score
    score = asyncio.run(
            ^^^^^^^^^^^^
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/ragas/metrics/_faithfulness.py", line 180, in _ascore
    answer_result = await self.llm.generate(
                          ^^^^^^^^^^^^^^^^^^
  File "/anaconda/envs/py311_gpt_evaluate/lib/python3.11/site-packages/langchain_core/language_models/chat_models.py", line 392, in generate
    batch_size=len(messages),
               ^^^^^^^^^^^^^
TypeError: object of type 'PromptValue' has no len()

Additional context ragas v0.1.0 (updating from v0.0.18 or 0.0.19, yeah, it's been a while since I was able to work on this...)

Feb 08 '24 13:02 kpeters

Hey @kpeters @jjmachan will be looking into this issue shortly but in the mean time you can easily use the same metrics using evaluate function as shown here https://docs.ragas.io/en/stable/getstarted/evaluation.html Now, with evaluate it's also easy to pass any custom llm / embedding as shown herehttps://docs.ragas.io/en/stable/howtos/customisations/index.html

Feb 09 '24 18:02 shahules786

Hey, it seems to me from reading your code that the code is not adapted to 0.1 version. If you need help with this feel free to schedule a call here @kpeters

Feb 09 '24 18:02 shahules786

Thanks @shahules786 , I'll have another look at it, but will get in touch if I need more help

Feb 12 '24 07:02 kpeters

seems that the evaluate function is more adapted to evaluate an entire dataset, what if i just want to score a single row ? Should we just create a dataset with 1 row ?

Feb 12 '24 17:02 bryan-agicap

Thanks Shahules, Rewriting with the evaluate function worked, although it does feel a bit bloated to wrap a metric function in the evaluate function

@bryan-agicap that's what I'm currently doing, creating a dataset with 1 row

Feb 15 '24 15:02 kpeters

hey @bryan-agicap and @kpeters you could use the score() or ascore() function that is available for each metric.

will document this shortly as I'm working on revamping the documentation but here is the code section that defines the interface https://github.com/explodinggradients/ragas/blob/f6a932ad5bb7998bb5632c5dd60db0aa3b13ea65/src/ragas/metrics/base.py#L63-L95

let me know if you need any help

Feb 19 '24 06:02 jjmachan

Amazing thank you!

Feb 19 '24 18:02 bdjafer

I'm seeing this same issue. It would be great to be able to easily call each metric and then compute the aggregate score with a different method. I'm especially interested in this because I want to see bad scores per row.

Minimal code to repro:

from ragas.metrics import (
    faithfulness
)
from langchain_openai import ChatOpenAI
faithfulness.llm = ChatOpenAI()

question = "What are the implications of the new policy?"
contexts = ["The new policy could change economic conditions."]
answer = "The policy will improve the economy."
ground_truths=["The policy will improve the economy."]

faithfulness.score(row={"question": question, "contexts": contexts, "answer": answer, 'ground_truths': ground_truths})

edit - for context: I'm trying to build an example for weave

Jun 14 '24 13:06 scottire