phase 2 of evaluation framework issues

Open oguzserdar opened this issue 9 months ago • 1 comments

phase 2 evaluation framework issues

issue 1: evaluation framework storage abstraction layer (SAL) integration

description:

right now, the evaluation framework's result persistence uses a basic JsonFileStorageProvider instantiated directly in the example script. this works for simple cases but doesn't align with agentdock's broader storage abstraction layer (sal) philosophy and limits us for production/scaled deployments.

my experience building systems like this is clear: what starts as a temporary file logger never stays temporary if you don't plan the abstraction early. we need to properly integrate evaluation storage with the storage abstraction layer (sal) to enable robust, configurable persistence suitable for both oss and future commercial needs. this provides the necessary foundation for storing results in databases (like postgres via vercel kv/postgres, supabase) or other backends managed by the SAL later on.

goals:

define SAL integration: determine the right way to integrate the EvaluationStorageProvider concept with the existing storage abstraction layer (SAL). should evaluators use the SAL directly, or should the EvaluationRunner be responsible for passing results to the SAL via a configured provider mechanism? leaning towards the runner handling it to keep evaluators focused.
refactor EvaluationStorageProvider interface: update the interface (if necessary) and potentially the JsonFileStorageProvider implementation to align with storage abstraction layer (SAL) patterns. this might involve making persistence configurable via environment variables or the main agentdock configuration, rather than direct path instantiation.
update runner logic: modify the EvaluationRunner to interact with the storage abstraction layer (SAL)-integrated storage provider mechanism.
update documentation/examples: reflect the new storage abstraction layer (SAL)-based storage approach in the evaluation framework docs and examples.

acceptance criteria:

a clear mechanism exists for the EvaluationRunner to persist results using the storage abstraction layer (SAL) when configured to do so.
the EvaluationStorageProvider interface (if changed) and the interaction pattern are documented.
direct file path instantiation (like in the current example) is deprecated or removed in favor of storage abstraction layer (SAL)-based configuration.
the JsonFileStorageProvider is updated to work with the new mechanism (potentially as a default SAL provider option for file-based logging if needed).
documentation clearly explains how to configure evaluation result persistence via the storage abstraction layer (SAL).

issue 2: integrate evaluation framework into open source client via http adapter

description:

the evaluation framework is built in agentdock-core, but needs to be callable from the frontend (both oss client and potentially other commercial clients later). we need a clean api interface for this.

the pattern established in platform-integration.md using an HttpAdapter is the way to go. this keeps the core api logic separate from the specific web framework (next.js now, maybe hono/express later), which is essential. building framework-specific api logic is a recipe for duplication and maintenance headaches – my experience screams avoid that.

this issue covers creating the api endpoint and basic ui interaction to run an evaluation for a given session and display its results, using this adapter pattern.

goals:

api endpoint: define and implement a route, likely post /api/evaluations/[sessionid], to trigger an evaluation run.
http adapter implementation: create or reuse an HttpAdapter interface (similar to the one in platform-integration.md) and implement it for the current next.js setup. this adapter will handle parsing the incoming request (getting sessionId, maybe basic config from the body) and formatting the outgoing response.
core logic handler: write the framework-agnostic handler function that the adapter calls. this function will:
- take the sessionId and any necessary config.
- fetch the required data to build the EvaluationInput.
- call the agentdock-core runEvaluation function.
- return the AggregatedEvaluationResult.
response handling: the HttpAdapter implementation takes the AggregatedEvaluationResult from the core handler and creates the appropriate http response for the specific framework (e.g., a next.js response object).
- critical: ensure the result object is properly serialized to json (JSON.stringify()) before sending, as http adapters might otherwise mishandle raw objects (ref: axios http adapter issue).
ui integration:
- add a button/action in the session view to call the new api endpoint.
- display the returned AggregatedEvaluationResult (overall score, individual results) in a clear way. basic display first, fancy later.
- handle potential long request times gracefully in the ui (e.g., loading indicator), recognizing that llm calls within runEvaluation can take a while. the adapter pattern itself doesn't dictate async job queues, but the ui needs to not hang.

acceptance criteria:

a post /api/evaluations/[sessionId] endpoint exists and uses an HttpAdapter pattern.
the endpoint correctly triggers runEvaluation using data for the specified session.
the AggregatedEvaluationResult is returned as a properly serialized json response via the adapter.
the ui allows triggering the evaluation for a session.
the ui displays the returned evaluation results.
the solution avoids framework-specific logic outside the adapter implementation.

issue 3: convert evaluation framework test todos to issues

description:

the initial testing phase (phase 1.7) identified several areas where tests were skipped or marked with todos due to complexity or suspected library issues. specifically:

the context variable assertion in the LLMJudgeEvaluator test (llm/__tests__/judge.test.ts).
the jaro-winkler algorithm test in the LexicalSimilarityEvaluator test (lexical/__tests__/similarity.test.ts).
potentially other minor todos left during testing.

we need to convert these remaining todos into trackable github issues to ensure they are addressed properly and don't get lost. leaving todos in code indefinitely is bad practice.

goals:

scan code: systematically review agentdock-core/src/evaluation/**/*.test.ts files for any remaining todo comments or skipped tests (it.skip, describe.skip).
create issues: for each identified todo/skip related to test functionality or coverage, create a dedicated, specific github issue.
link issues: reference the relevant code locations in each issue.
remove todos: once issues are created, remove the corresponding todo comments from the codebase.

acceptance criteria:

all test-related todo comments within the agentdock-core/src/evaluation/ directory are removed.
specific github issues exist tracking the work needed to enable skipped tests or address test coverage gaps identified by previous todos.

issue 4: enhance evaluation score normalization

description:

the current score normalization logic in the EvaluationRunner handles the basic scales defined in phase 1 (binary, likert5, pass/fail, some numeric 0-1/0-100). however, it could be more robust and handle a wider range of numeric inputs or custom scales more gracefully.

the pr review noted the need to extend this logic in phase 2 to handle more numeric ranges consistently as part of improving aggregation. this came from coderabbit, imo it's not non-issue but we'lll take a look.

goals:

review existing logic: analyze the current normalizeEvaluationScore function in runner/index.ts.
define extended requirements: determine what other numeric ranges (e.g., -1 to 1, arbitrary ranges) or custom scale patterns need explicit normalization rules for meaningful aggregation into overallScore.
implement enhancements: update the normalization logic to handle the extended requirements reliably.
add tests: create unit tests specifically covering the new and existing normalization cases.

acceptance criteria:

normalizeEvaluationScore function can handle a wider, clearly defined set of numeric input score ranges.
clear rules exist (and are documented) for how different scales are normalized (or explicitly excluded from numeric aggregation).
unit tests cover the extended normalization logic.

issue 5: explore monitoring/observability integration for evaluation metrics

description:

currently, evaluation results are either returned directly or logged to a file. for production monitoring and deeper analysis, integrating these structured evaluation metrics (EvaluationResult, AggregatedEvaluationResult) with standard monitoring and observability systems (like opentelemetry, datadog, etc.) would be highly valuable.

this involves exploring how to best emit these metrics.

goals:

research integration points: investigate how other frameworks emit structured metric data. opentelemetry semantic conventions might be relevant.
prototype exporter: potentially create a proof-of-concept EvaluationStorageProvider or a separate mechanism within the EvaluationRunner that formats and exports evaluation results as structured logs or metrics compatible with a common observability backend.
define standard payload: determine a standard, useful subset of evaluation data to export as metrics/traces (e.g., scores per criterion, overall score, duration, error flags, key metadata).

acceptance criteria:

a clear recommendation or design proposal exists for how evaluation metrics can be integrated with standard observability platforms.
(stretch) a basic prototype demonstrates exporting key evaluation metrics in a standard format (e.g., opentelemetry logs/metrics).

other phase 2 items (lower priority / future issues / might create more issues from these):

sophisticated aggregation & reporting: allow for more complex aggregation strategies beyond weighted average, configurable reporting formats, or basic statistical analysis on results.
agent-level evaluation paradigms: develop patterns or specialized evaluators for assessing multi-turn conversation quality, complex task completion across multiple steps, or agent adherence to long-term goals.
configuration from files: allow loading EvaluationRunConfig (or parts of it, like criteria sets or evaluator profiles) from static files (e.g., json, ts) for easier management of standard evaluation suites.
implement runner validation: complete the validateEvaluatorConfigs logic in the EvaluationRunner to ensure evaluator configurations are valid before execution.
implement advanced evaluator features: address phase 2 todos within existing evaluators, such as advanced llm judge configurations (e.g., reference-free LLMJudgeEvaluator), sequence checking in ToolUsageEvaluator, and potentially adding more algorithms or features to the lexical suite.

May 08 '25 04:05 oguzserdar

this chatgpt deep research comparison is useful context as we look at phase 2: https://chatgpt.com/s/dr_681c5b68ae58819195d981266ea11d6c.

basically confirms our phase 1 focus on a solid core was right, and highlights why things like the sal integration (#158 issue 1) and observability (#158 issue 5) are important next steps to align better with broader industry tooling.

May 08 '25 08:05 oguzserdar