dd-trace-py chore(llmobs): implement non skeleton code for ragas faithfulness

This PR adds in the non-boiler plate code for the ragas faithfulness evaluator.

The majority of LOC changes are from cassettes/requirements. The main logic is in ddtrace/llmobs/_evaluators/ragas/faithfulness.py.

There are four important features of this PR:

1 .Extracting the inputs to a ragas faithfulness eval from a span

A span event must contain data necessary for ragas evaluations - question, context, and answer.

The evaluator tries to extract this data by looking at the span event using the following logic:

question = input.prompt.variables.question OR input.messages[-1].content
context = input.prompt.variables.context
answer = output.messages[-1].content

Relevant tests... test_ragas_faithfulness_submits_evaluation... test_ragas_faithfulness_returns_none_if_inputs_extraction_fails

2. Ragas faithfulness implementation

See the evaluate function for the underlying Ragas faithfulness implementation.

It roughly follows the original source implementation in the ragas framework.

Relevant tests... test_ragas_faithfulness_submits_evaluation...

3. Tracing RAGAS

Tracing RAGAS is a requirement for faithfulness a user's ml app will be polluted by a bunch of auto-instrumented langchain spans.

The ml_app of ragas traces should be dd-ragas-{original ml app name}.
All ragas traces are marked with an runner.integration:ragas tag. This tells us that these traces are evaluations traces from the ragas integration. We can tell it's a ragas span by looking at the ml app of the span at trace processing time. We also use this to safegaurd against infinite eval loops (enqueuing an llm span generated from an evaluation to the evaluator runner).

Relevant tests... test_ragas_faithfulness_emits_traces test_llmobs_with_evaluator_runner_does_not_enqueue_evaluation_spans

4. RAGAS Evaluator Setup

Ragas dependencies (ragas, langchain) are only required if the ragas faithfulness evaluator is configured.
The ragas evaluator should also always use the most up to date faithfulness instance from the ragas library itself to allow a user to customize the llm's & prompts for faithfulness.
If an llm is not set by the user, we use the default llm given to us by ragas's llm_factory method

Relevant tests... test_ragas_faithfulness_disabled_if_dependencies_not_present test_ragas_evaluator_init test_ragas_faithfulness_has_modified_faithfulness_instance

Checklist

[x] PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

[ ] Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

Sep 24 '24 20:09 lievan

CODEOWNERS have been resolved as:

.riot/requirements/12c5529.txt                                          @DataDog/apm-python
.riot/requirements/146f2d8.txt                                          @DataDog/apm-python
.riot/requirements/1687eab.txt                                          @DataDog/apm-python
.riot/requirements/4102ef5.txt                                          @DataDog/apm-python
.riot/requirements/771848b.txt                                          @DataDog/apm-python
ddtrace/llmobs/_evaluators/ragas/models.py                              @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_evaluator_runner.test_evaluator_runner_periodic_enqueues_eval_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_evaluator_runner.test_evaluator_runner_timed_enqueues_eval_metric.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_faithfulness_evaluator.emits_traces_and_evaluations_on_exit.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_faithfulness_evaluator.test_ragas_faithfulness_emits_traces.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_faithfulness_evaluator.test_ragas_faithfulness_submits_evaluation.yaml  @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_ragas_faithfulness_evaluator.test_ragas_faithfulness_submits_evaluation_on_span_with_question_in_messages.yaml  @DataDog/ml-observability
tests/llmobs/test_llmobs_ragas_faithfulness_evaluator.py                @DataDog/ml-observability
ddtrace/llmobs/_constants.py                                            @DataDog/ml-observability
ddtrace/llmobs/_evaluators/ragas/faithfulness.py                        @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
ddtrace/llmobs/_evaluators/sampler.py                                   @DataDog/ml-observability
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
riotfile.py                                                             @DataDog/apm-python
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_evaluator_runner.send_score_metric.yaml  @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability
tests/llmobs/test_llmobs_trace_processor.py                             @DataDog/ml-observability

Sep 24 '24 20:09 github-actions[bot]

Benchmarks

Benchmark execution time: 2024-10-30 15:46:17

Comparing candidate commit 23753ecf9daf741f4281618c381e68d6070fe47c in PR branch evan.li/ragas-faithfulness with baseline commit f3b5275b8a6be8d3b7c8ab1e4e40ed96cb9aa386 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 328 metrics, 2 unstable metrics.

Sep 24 '24 20:09 pr-commenter[bot]

Datadog Report

Branch report: evan.li/ragas-faithfulness Commit report: 23753ec Test service: dd-trace-py

:white_check_mark: 0 Failed, 1286 Passed, 0 Skipped, 33m 34.08s Total duration (6m 45.38s time saved)

Oct 22 '24 23:10 datadog-dd-trace-py-rkomorn[bot]

dd-trace-py dd-trace-py copied to clipboard

chore(llmobs): implement non skeleton code for ragas faithfulness

1 .Extracting the inputs to a ragas faithfulness eval from a span

2. Ragas faithfulness implementation

3. Tracing RAGAS

4. RAGAS Evaluator Setup

Checklist

Reviewer Checklist

Benchmarks

Datadog Report

dd-trace-py
dd-trace-py copied to clipboard