phase 2 of evaluation framework issues
phase 2 evaluation framework issues
issue 1: evaluation framework storage abstraction layer (SAL) integration
description:
right now, the evaluation framework's result persistence uses a basic JsonFileStorageProvider instantiated directly in the example script. this works for simple cases but doesn't align with agentdock's broader storage abstraction layer (sal) philosophy and limits us for production/scaled deployments.
my experience building systems like this is clear: what starts as a temporary file logger never stays temporary if you don't plan the abstraction early. we need to properly integrate evaluation storage with the storage abstraction layer (sal) to enable robust, configurable persistence suitable for both oss and future commercial needs. this provides the necessary foundation for storing results in databases (like postgres via vercel kv/postgres, supabase) or other backends managed by the SAL later on.
goals:
-
define SAL integration: determine the right way to integrate the
EvaluationStorageProviderconcept with the existing storage abstraction layer (SAL). should evaluators use the SAL directly, or should theEvaluationRunnerbe responsible for passing results to the SAL via a configured provider mechanism? leaning towards the runner handling it to keep evaluators focused. -
refactor
EvaluationStorageProviderinterface: update the interface (if necessary) and potentially theJsonFileStorageProviderimplementation to align with storage abstraction layer (SAL) patterns. this might involve making persistence configurable via environment variables or the main agentdock configuration, rather than direct path instantiation. -
update runner logic: modify the
EvaluationRunnerto interact with the storage abstraction layer (SAL)-integrated storage provider mechanism. - update documentation/examples: reflect the new storage abstraction layer (SAL)-based storage approach in the evaluation framework docs and examples.
acceptance criteria:
- a clear mechanism exists for the
EvaluationRunnerto persist results using the storage abstraction layer (SAL) when configured to do so. - the
EvaluationStorageProviderinterface (if changed) and the interaction pattern are documented. - direct file path instantiation (like in the current example) is deprecated or removed in favor of storage abstraction layer (SAL)-based configuration.
- the
JsonFileStorageProvideris updated to work with the new mechanism (potentially as a default SAL provider option for file-based logging if needed). - documentation clearly explains how to configure evaluation result persistence via the storage abstraction layer (SAL).
issue 2: integrate evaluation framework into open source client via http adapter
description:
the evaluation framework is built in agentdock-core, but needs to be callable from the frontend (both oss client and potentially other commercial clients later). we need a clean api interface for this.
the pattern established in platform-integration.md using an HttpAdapter is the way to go. this keeps the core api logic separate from the specific web framework (next.js now, maybe hono/express later), which is essential. building framework-specific api logic is a recipe for duplication and maintenance headaches – my experience screams avoid that.
this issue covers creating the api endpoint and basic ui interaction to run an evaluation for a given session and display its results, using this adapter pattern.
goals:
-
api endpoint: define and implement a route, likely
post /api/evaluations/[sessionid], to trigger an evaluation run. -
http adapter implementation: create or reuse an
HttpAdapterinterface (similar to the one inplatform-integration.md) and implement it for the current next.js setup. this adapter will handle parsing the incoming request (gettingsessionId, maybe basic config from the body) and formatting the outgoing response. -
core logic handler: write the framework-agnostic handler function that the adapter calls. this function will:
- take the
sessionIdand any necessary config. - fetch the required data to build the
EvaluationInput. - call the
agentdock-corerunEvaluationfunction. - return the
AggregatedEvaluationResult.
- take the
-
response handling: the
HttpAdapterimplementation takes theAggregatedEvaluationResultfrom the core handler and creates the appropriate http response for the specific framework (e.g., a next.jsresponseobject).-
critical: ensure the result object is properly serialized to json (
JSON.stringify()) before sending, as http adapters might otherwise mishandle raw objects (ref: axios http adapter issue).
-
critical: ensure the result object is properly serialized to json (
-
ui integration:
- add a button/action in the session view to call the new api endpoint.
- display the returned
AggregatedEvaluationResult(overall score, individual results) in a clear way. basic display first, fancy later. - handle potential long request times gracefully in the ui (e.g., loading indicator), recognizing that llm calls within
runEvaluationcan take a while. the adapter pattern itself doesn't dictate async job queues, but the ui needs to not hang.
acceptance criteria:
- a
post /api/evaluations/[sessionId]endpoint exists and uses anHttpAdapterpattern. - the endpoint correctly triggers
runEvaluationusing data for the specified session. - the
AggregatedEvaluationResultis returned as a properly serialized json response via the adapter. - the ui allows triggering the evaluation for a session.
- the ui displays the returned evaluation results.
- the solution avoids framework-specific logic outside the adapter implementation.
issue 3: convert evaluation framework test todos to issues
description:
the initial testing phase (phase 1.7) identified several areas where tests were skipped or marked with todos due to complexity or suspected library issues. specifically:
- the context variable assertion in the
LLMJudgeEvaluatortest (llm/__tests__/judge.test.ts). - the jaro-winkler algorithm test in the
LexicalSimilarityEvaluatortest (lexical/__tests__/similarity.test.ts). - potentially other minor todos left during testing.
we need to convert these remaining todos into trackable github issues to ensure they are addressed properly and don't get lost. leaving todos in code indefinitely is bad practice.
goals:
-
scan code: systematically review
agentdock-core/src/evaluation/**/*.test.tsfiles for any remainingtodocomments or skipped tests (it.skip,describe.skip). - create issues: for each identified todo/skip related to test functionality or coverage, create a dedicated, specific github issue.
- link issues: reference the relevant code locations in each issue.
-
remove todos: once issues are created, remove the corresponding
todocomments from the codebase.
acceptance criteria:
- all test-related
todocomments within theagentdock-core/src/evaluation/directory are removed. - specific github issues exist tracking the work needed to enable skipped tests or address test coverage gaps identified by previous todos.
issue 4: enhance evaluation score normalization
description:
the current score normalization logic in the EvaluationRunner handles the basic scales defined in phase 1 (binary, likert5, pass/fail, some numeric 0-1/0-100). however, it could be more robust and handle a wider range of numeric inputs or custom scales more gracefully.
the pr review noted the need to extend this logic in phase 2 to handle more numeric ranges consistently as part of improving aggregation. this came from coderabbit, imo it's not non-issue but we'lll take a look.
goals:
-
review existing logic: analyze the current
normalizeEvaluationScorefunction inrunner/index.ts. -
define extended requirements: determine what other numeric ranges (e.g., -1 to 1, arbitrary ranges) or custom scale patterns need explicit normalization rules for meaningful aggregation into
overallScore. - implement enhancements: update the normalization logic to handle the extended requirements reliably.
- add tests: create unit tests specifically covering the new and existing normalization cases.
acceptance criteria:
-
normalizeEvaluationScorefunction can handle a wider, clearly defined set of numeric input score ranges. - clear rules exist (and are documented) for how different scales are normalized (or explicitly excluded from numeric aggregation).
- unit tests cover the extended normalization logic.
issue 5: explore monitoring/observability integration for evaluation metrics
description:
currently, evaluation results are either returned directly or logged to a file. for production monitoring and deeper analysis, integrating these structured evaluation metrics (EvaluationResult, AggregatedEvaluationResult) with standard monitoring and observability systems (like opentelemetry, datadog, etc.) would be highly valuable.
this involves exploring how to best emit these metrics.
goals:
- research integration points: investigate how other frameworks emit structured metric data. opentelemetry semantic conventions might be relevant.
-
prototype exporter: potentially create a proof-of-concept
EvaluationStorageProvideror a separate mechanism within theEvaluationRunnerthat formats and exports evaluation results as structured logs or metrics compatible with a common observability backend. - define standard payload: determine a standard, useful subset of evaluation data to export as metrics/traces (e.g., scores per criterion, overall score, duration, error flags, key metadata).
acceptance criteria:
- a clear recommendation or design proposal exists for how evaluation metrics can be integrated with standard observability platforms.
- (stretch) a basic prototype demonstrates exporting key evaluation metrics in a standard format (e.g., opentelemetry logs/metrics).
other phase 2 items (lower priority / future issues / might create more issues from these):
- sophisticated aggregation & reporting: allow for more complex aggregation strategies beyond weighted average, configurable reporting formats, or basic statistical analysis on results.
- agent-level evaluation paradigms: develop patterns or specialized evaluators for assessing multi-turn conversation quality, complex task completion across multiple steps, or agent adherence to long-term goals.
-
configuration from files: allow loading
EvaluationRunConfig(or parts of it, like criteria sets or evaluator profiles) from static files (e.g., json, ts) for easier management of standard evaluation suites. -
implement runner validation: complete the
validateEvaluatorConfigslogic in theEvaluationRunnerto ensure evaluator configurations are valid before execution. -
implement advanced evaluator features: address phase 2 todos within existing evaluators, such as advanced llm judge configurations (e.g., reference-free
LLMJudgeEvaluator), sequence checking inToolUsageEvaluator, and potentially adding more algorithms or features to the lexical suite.
this chatgpt deep research comparison is useful context as we look at phase 2: https://chatgpt.com/s/dr_681c5b68ae58819195d981266ea11d6c.
basically confirms our phase 1 focus on a solid core was right, and highlights why things like the sal integration (#158 issue 1) and observability (#158 issue 5) are important next steps to align better with broader industry tooling.