semantic-conventions Genai user feedback evaluation

Changes

It provides details for user feedback event which can be used for evaluation purposes.

Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.

Merge requirement checklist

[x] CONTRIBUTING.md guidelines followed.
[x] Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
[ ] schema-next.yaml updated with changes to existing conventions.

Aug 06 '24 18:08 truptiparkar7

:white_check_mark: login: lmolkova / name: Liudmila Molkova (a538723f86efd47ca57fdd5b4da89467aa631087, 2da159c727e75e8b76b3daae52071e64df65fa1f)
:x: - login: @truptiparkar7 . The commit (5e777af673baa3b5f4b0afac1bcd0f840e78e794, fdc5e6a22cf9be93b290ce27cb9acaf20e4b6049, b937f024c75bc487ab4578c35182fddaa776aac9, 8030345ea33a45b8d223bd308c74da386b1212f5, a17db184d1938779e70cd481e06a9ba62d7c5043) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

Aug 06 '24 18:08 linux-foundation-easycla[bot]

I will share some thoughts on the challenges we(at Langtrace) faced while implementing this:

For some context, user feedback evaluations are generally collected as a thumbs up or thumbs down for LLM generations (typically in a chatbot) for the sake of understanding the model performance. So, this is a critical requirement for folks building with LLMs today.

Challenges:

Because the feedback can only be collected post the LLM generates the response, it means the span for the LLM generation has already been created. And today, as far as I can tell, there is no way to attach an attribute to an already generated span in a OTEL native way(there is no API to do this).
As a result, at Langtrace, we decided to send the spanId of the LLM generated response to the application layer through a higher order function/decorator which the application developer needs to use in order to capture user feedback scores. On the application layer, the developer has access to the spanId which is then used for attaching the user feedback score and other user metadata such as user Id that uniquely identifies the user who gave this feedback.
Now, at this stage, you have 2 options: Either generate a new span that's a child of this span(which is very tricky to establish) or store the evaluation against the spanId in a completely separate metadata store. We went with the latter approach for a few reasons:

Create a new child span was very tricky to make it work, especially when we are talking about streaming responses or using other implementations of the LLM SDK (like vercel ai sdk)
Attaching the feedback to the span by exposing a vendor specific API off the database that stores this span was expensive and difficult to maintain(also as a general rule of thumb we weren't comfortable mutating the trace data post generation)
For conversations happening in a single session, it ends up creating multiple feedback spans and when users change their feedback for the same generated response, we end up creating more than one span linking the same response ID or the span ID and it's impossible to know what the actual feedback is unless you sort the spans by created time which was not clean.

If you are curious to learn more about how we implemented this, see below the link:

Docs: https://docs.langtrace.ai/tracing/trace_user_feedback#understanding-user-feedback
SendUserFeedback API which sends the feedback to an external data store - https://github.com/Scale3-Labs/langtrace-python-sdk/blob/c024295ccf8c2fc9ecb13714826c2b5c12deb010/src/langtrace_python_sdk/utils/with_root_span.py#L180
A decorator that attaches the spanId of the span created as a result of the LLM generation and allows the application to access it as a function parameter - https://github.com/Scale3-Labs/langtrace-python-sdk/blob/c024295ccf8c2fc9ecb13714826c2b5c12deb010/src/langtrace_python_sdk/utils/with_root_span.py#L67

Aug 14 '24 18:08 karthikscale3

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Sep 13 '24 03:09 github-actions[bot]

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.

Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Oct 06 '24 08:10 Rutledge

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized.

Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

Oct 09 '24 16:10 karthikscale3

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized. Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful

Oct 09 '24 18:10 marcklingen

We at Scorecard.io use OTEL for tracing but have our user feedback and model graded (LLM as judge) evaluations stored separately and linked so would love if there is a natural way for this data model to be standardized. Kartik makes great points (and we have similar requirements) of evaluations needing to be supported asynchronously from span generation time.

Yeah, thats the number 1 challenge. We are doing it exactly the same way at the moment. Evals are stored in a separate model and linked using the span_id as foreign key for referencing the original trace.

+1, see my comment above. I think this also helps to correlate scores with non-llm calls which is useful

We need a correlation(s) that works also when span_id is not available. The trace context is not available in all situations where evaluation scores or feedback are captured. There could also be other correlations in a system, response_id, session_id, turn_id, that are meaningful to a particular application or toolset.

Is there a straightforward way to offer more than one option in the conventions? response_id, span_id, turn_id, etc. I'd think you want to require at least one be present.

Oct 10 '24 05:10 drewby

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Nov 02 '24 03:11 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Nov 10 '24 03:11 github-actions[bot]

Hi all 👋 I’m a maintainer of Arize-Phoenix and OpenInference. Just want to share some experiences with evaluations as it pertains to instrumentation in case it is helpful.

Both the Arize SaaS platform and Phoenix open-source application accept human evaluations submitted via the UI or programmatic evaluations (LLM or code). Since these evaluations come after the span has ended, OpenInference doesn’t attach evaluations to the OTel span under scrutiny itself. Rather, we maintain separate evaluation tables with foreign key relations back to the spans table (in the case of Phoenix, which uses a relational DB) or keep the evaluations as columns in the spans table (in the case of Arize, which uses an OLAP DB). LLM evaluations are traced in the same way as LLM calls in the application, and OpenInference doesn’t currently have semantic conventions specifically related to evaluations.

We think of an evaluation as being comprised of:

name (required)
label (optional)
score (optional)
explanation (optional)

This attempts to capture that some evaluations are categorical, some numeric, some both, and some accompanied by a human- or LLM-generated explanation. Given the generic nature of the evaluations we ingest, we don’t place max/ min limitations on the score.

Much of what we evaluate is not just LLM calls, but chains, retrievals made via RAG, entire traces or “sessions” (groups of traces corresponding to a back-and-forth conversation between a user and the application). So we allow evaluations to be attached not just to LLM spans, but to any span kind defined in the OpenInference spec (typically, we attach evaluations for traces to the root span). The evaluation interface is pretty consistent no matter what we’re evaluating.

Nov 18 '24 01:11 axiomofjoy

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Dec 03 '24 03:12 github-actions[bot]

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Dec 10 '24 03:12 github-actions[bot]

semantic-conventions semantic-conventions copied to clipboard

Genai user feedback evaluation

Changes

Merge requirement checklist

semantic-conventions
semantic-conventions copied to clipboard