agents Schema eval

Oct 28 '25 18:10 shagun-singh-inkeep

⚠️ No Changeset found

Latest commit: e9cdb2ece88b4b83810bc6d674e62b92612a2fbd

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Oct 28 '25 18:10 changeset-bot[bot]

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
agents-docs	Ready	Preview	Comment	Nov 6, 2025 5:44pm
agents-manage-api	Error			Nov 6, 2025 5:44pm
agents-manage-ui	Error			Nov 6, 2025 5:44pm
agents-run-api	Error			Nov 6, 2025 5:44pm

Oct 28 '25 18:10 vercel[bot]

Claude finished @shagun-singh-inkeep's task —— View job

TypeScript Staff Engineer Review

I've completed my comprehensive analysis of this schema evaluation PR. This introduces a significant evaluation framework with several concerning issues.

Todo List:

[x] Read and analyze the schema changes in packages/agents-core/src/db/schema.ts
[x] Research existing codebase patterns and related files for context
[x] Evaluate technical implementation and code quality
[x] Assess system architecture and design decisions
[x] Review customer impact and API considerations
[x] Provide structured feedback and recommendations

🚨 CRITICAL Issues

1. BREAKING CHANGE: Inconsistent Scoping Model

The PR fundamentally changes scoping patterns without justification
datasetItem and evalTestSuiteRun removed tenant scoping (tenantScoped → simple id primary key)
dataset, evaluator, evalTestSuiteConfig retain tenant scoping
This creates data isolation vulnerabilities where dataset items could theoretically be accessed across tenants

Reference: packages/agents-core/src/db/schema.ts:696-730, packages/agents-core/src/db/schema.ts:761-785

2. Broken Foreign Key Relationships

Relations updated to use simple field references instead of composite tenant+project keys
Example: evalResultRelations conversation lookup changed from [evalResult.tenantId, evalResult.projectId, evalResult.conversationId] to just [evalResult.conversationId]
Risk: Cross-tenant data access if conversation IDs collide
Impact: Security vulnerability in multi-tenant system

🔧 System Architecture Issues

3. Inconsistent Design Patterns

Existing codebase follows consistent tenant scoping pattern across all tables
This PR breaks the pattern without establishing clear architectural guidelines
Risk: Future developers will be confused about when to use which pattern
Maintenance burden: Two different scoping models to maintain

4. Missing Validation Schema Integration

New tables reference ModelSettings type but no corresponding validation schemas created
No integration with existing validation patterns in /packages/agents-core/src/validation/schemas.ts
Risk: Runtime type errors, inconsistent API validation

5. No Migration Strategy

No evidence of database migration files or strategy for existing data
Breaking change requires careful data migration planning

🔍 Technical Implementation Concerns

6. Relationship Complexity

// Overly complex evaluation configuration structure
export const conversationEvaluationConfigEvaluator = sqliteTable(
  'conversation_evaluation_config_evaluator', // Long table name
  {
    id: text('id').notNull(),
    conversationEvaluationConfigId: text('conversation_evaluation_config_id').notNull(),
    evaluatorId: text('evaluator_id').notNull(),
    // ...
  }

Junction table naming could be simplified
Consider conv_eval_config_evaluators for brevity while maintaining clarity

7. Inconsistent ID Strategy

Most tables use composite primary keys following tenant scoping
New tables mix simple id primary keys with foreign key references
Recommendation: Stick to one pattern consistently

8. JSON Schema Storage Without Validation

schema: blob('schema', { mode: 'json' }).$type<Record<string, unknown>>().notNull(),

Storing evaluation schemas as untyped JSON
No runtime validation ensures schema integrity
Risk: Invalid schemas could break evaluation pipelines

📊 Customer Impact Analysis

9. API Consistency Breaking

If this affects customer-facing APIs, the scoping changes will break existing integrations
Dataset and evaluation APIs will have different authentication/authorization patterns than other resources
Critical: Need explicit strategy for backward compatibility

10. Missing Feature Documentation

No clear understanding of the evaluation workflow from schema alone
Complex many-to-many relationships without usage examples
Risk: Difficult for API consumers to understand proper usage patterns

✅ Recommendations

Immediate Actions Required:

Revert scoping changes - Use consistent tenantScoped/projectScoped patterns for ALL new tables
Fix foreign key relationships - Maintain composite key relationships for proper tenant isolation
Add validation schemas - Create Zod schemas for all new types in /packages/agents-core/src/validation/schemas.ts
Create migration strategy - Document how existing data will be handled

Architectural Improvements:

Simplify table names - Consider shorter, clearer names (e.g., eval_config_evaluators)
Add proper typing - Replace Record<string, unknown> with proper typed interfaces
Document evaluation workflow - Add comprehensive documentation for the evaluation system

Security & Best Practices:

Tenant isolation audit - Ensure all new tables properly implement tenant boundaries
Add database constraints - Consider additional constraints on status fields and relationships
Performance indexing - Plan indexes for common query patterns

This evaluation framework addition is substantial but needs significant architectural consistency improvements before it can safely merge. The tenant scoping changes represent a critical security concern that must be addressed.

Verdict: ❌ Major revisions required - Security and consistency issues must be resolved.

Oct 28 '25 18:10 claude[bot]

@claude can you review this again?

want to make sure its clear the schemas and the relationships are clear

ignore the composite keys and the tenant scoping, ignore any indexing / perf concerns

Oct 28 '25 23:10 robert-inkeep

Claude finished @robert-inkeep's task —— View job

Schema Review: Focus on Clarity and Relationships

I've reviewed the evaluation framework schema additions, focusing specifically on clarity and relationships as requested (ignoring composite keys, tenant scoping, and performance).

Todo List:

[x] Read the schema changes in packages/agents-core/src/db/schema.ts
[x] Map out the evaluation framework structure and relationships
[x] Assess clarity of table names and field definitions
[x] Review relationship definitions for clarity
[x] Provide feedback focused on schema and relationship clarity

✅ Clear Aspects

Well-Structured Core Concepts

The evaluation framework introduces a clear hierarchy:

Dataset → DatasetItem (1:N) - Clear parent-child for test data
Evaluator - Self-contained evaluation definition with prompt and schema
EvalTestSuiteRun - Execution record linking dataset, agent, and config
EvalResult - Individual evaluation outcomes

Good Field Definitions

Clear status enums: 'pending'|'done'|'failed' (lines 769, 822)
Well-typed configuration objects with clear structure:
- simulationConfig with detailed agent simulation settings (lines 706-718)
- conversationFilter with multiple filtering options (lines 853-860)
Explicit model configuration fields using ModelSettings type

🔍 Areas Needing Clarity

1. Relationship Ambiguity: Two Evaluation Paths

The schema appears to support two different evaluation workflows, but their relationship is unclear:

Path A: Test Suite Evaluations

evalTestSuiteRun → evalTestSuiteRunEvaluator → evaluator
For running predefined test suites against agents

Path B: Conversation Evaluations

conversationEvaluationConfig → conversationEvaluationConfigEvaluator → evaluator
For evaluating live conversations

Issue: evalResult serves both paths (has both suiteRunId and conversationId), but this dual-purpose isn't clear from the schema structure.

2. Optional vs Required Relationships

In evalResult (lines 815-846):

suiteRunId is optional (text('suite_run_id'))
datasetItemId is optional
conversationId is required

This suggests evalResult can be:

A test suite result (with suiteRunId + datasetItemId)
A conversation evaluation result (with just conversationId)
Both? Neither? The schema doesn't make the valid combinations clear.

Recommendation: Consider adding a discriminator field like evaluationType: 'suite' | 'conversation' or splitting into two tables.

3. Unclear Table Naming

Long Junction Table Name:

conversationEvaluationConfigEvaluator (line 872) is verbose
Consider: conv_eval_config_evaluator or conversation_eval_evaluators

Inconsistent Pluralization:

eval_test_suite_run_evaluators (plural) vs other junction tables
Consider consistent naming convention

4. Missing Relationship Context

evalTestSuiteRun.agent relationship (lines 767, 780-783):

References agents.id directly without tenant/project context
While you asked to ignore scoping, this makes the relationship intent unclear
Does this mean a test suite can run against agents from different projects?

5. Schema Storage Without Clear Purpose

evaluator.schema (line 737):

Stores as Record<string, unknown>
Purpose unclear: Is this for validating evaluation results? Defining expected output format?
Consider adding a comment or more specific type

datasetItem.expectedOutput (line 705):

Array of messages but relationship to evaluator.schema unclear
How do these work together during evaluation?

📋 Recommendations for Clarity

1. Add Discriminator for Dual-Purpose Tables

evalResult: {
  evaluationType: text('evaluation_type').$type<'suite' | 'conversation'>().notNull(),
  // This clarifies which fields should be populated
}

2. Document Complex Relationships

Add comments explaining the evaluation flow:

// Test suite evaluation flow:
// evalTestSuiteRun -> evalTestSuiteRunEvaluator -> evaluator -> evalResult
// Conversation evaluation flow:  
// conversationEvaluationConfig -> conversationEvaluationConfigEvaluator -> evaluator -> evalResult

3. Clarify Optional Field Combinations

Either:

Make fields required based on use case, or
Add check constraints to ensure valid combinations, or
Document valid field combinations in comments

4. Simplify Table Names

conversationEvaluationConfigEvaluator → conv_eval_evaluators
Keep names under 30 characters for better readability

5. Type the Schema Fields

Replace Record<string, unknown> with a more specific type or interface that clarifies the expected structure of evaluation schemas.

Summary

The evaluation framework's core structure is sound and the relationships are technically correct. However, the dual-purpose nature of evalResult and the two parallel evaluation paths could benefit from clearer distinction. The schema would be more maintainable with:

Clearer separation or documentation of the two evaluation workflows
More explicit handling of optional field combinations
Simplified naming for better readability
Better typing for JSON schema fields

The relationships themselves are well-defined through proper foreign keys and the Drizzle relations are correctly configured.

Oct 28 '25 23:10 claude[bot]