agents icon indicating copy to clipboard operation
agents copied to clipboard

schema for evals

Open shagun-singh-inkeep opened this issue 2 months ago • 3 comments

shagun-singh-inkeep avatar Nov 10 '25 22:11 shagun-singh-inkeep

🦋 Changeset detected

Latest commit: b3ceb343b7230e169c6ac0e0c686295739962cce

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-manage-ui Patch
@inkeep/agents-manage-api Patch
@inkeep/agents-run-api Patch
@inkeep/agents-eval-api Patch
@inkeep/agents-cli Patch
@inkeep/agents-core Patch
@inkeep/agents-mcp Patch
@inkeep/agents-sdk Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

changeset-bot[bot] avatar Nov 10 '25 22:11 changeset-bot[bot]

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
agents-manage-api Ready Ready Preview Comment Nov 18, 2025 3:23pm
agents-manage-ui Error Error Nov 18, 2025 3:23pm
agents-run-api Ready Ready Preview Comment Nov 18, 2025 3:23pm
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
agents-docs Skipped Skipped Nov 18, 2025 3:23pm

vercel[bot] avatar Nov 10 '25 22:11 vercel[bot]

Claude finished @shagun-singh-inkeep's task —— View job


Code Review: Schema for Evals

Review Tasks:

  • [x] Read repository documentation and understand codebase patterns
  • [x] Analyze the schema changes in detail
  • [x] Review existing schema patterns and related code
  • [x] Evaluate engineering design and architecture considerations
  • [x] Check for security, performance, and scalability concerns
  • [x] Provide comprehensive feedback and recommendations

Summary

This PR introduces a comprehensive evaluation system schema for the agents platform. The changes add 8 new tables and their corresponding relations to support dataset management, evaluator configuration, suite-based evaluations, and result tracking.

Key Architecture Review

Strengths

  1. Comprehensive Design: The schema provides a complete evaluation workflow from datasets → runs → evaluations → results
  2. Proper Scoping: Follows existing tenant/project scoping patterns consistently
  3. Flexible Evaluation: Supports both dataset-based and historical conversation evaluation
  4. Rich Metadata: Good use of JSON columns for flexible configuration and structured output
  5. Foreign Key Integrity: Proper cascade deletes and referential integrity throughout

⚠️ Areas for Consideration

1. Schema Design & Normalization

Missing Indexes: Several high-query columns lack indexes:

-- Recommended additions:
index('eval_result_suite_run_idx').on(table.suiteRunId)
index('eval_result_evaluator_idx').on(table.evaluatorId) 
index('dataset_run_agent_idx').on(table.agentId)
index('eval_suite_config_dataset_run_idx').on(table.datasetRunId)

Primary Key Inconsistency: Most tables use composite PKs [tenantId, projectId, id] but some eval tables use simple id PKs. This breaks the established pattern:

  • datasetRun, datasetItem, evalSuiteRun, evalResult should use composite PKs for consistency
  • If intentional, document the reasoning

2. Data Model Concerns

Status Field Type Safety (packages/agents-core/src/db/schema.ts:971,1040):

// Current - unsafe
status: text('status').$type<'done'|'failed'>().notNull()

// Better - use enum/check constraint
status: text('status', { enum: ['pending', 'running', 'done', 'failed'] }).notNull()

Missing Status: No 'pending'/'running' states for long-running operations

Dataset Item Foreign Key Issue (packages/agents-core/src/db/schema.ts:1060-1063):

foreignKey({
  columns: [table.datasetItemId], 
  foreignColumns: [datasetItem.id],
  name: 'eval_result_dataset_item_fk',
}).onDelete('cascade')

This should be nullable/optional since evalResult can evaluate historical conversations without dataset items.

3. Performance & Scalability

Large JSON Storage: simulationConfig and evaluation schemas stored as JSON blobs may impact query performance. Consider extracting frequently-queried fields.

Sampling Configuration: sampleRate as real type may lead to precision issues. Consider storing as integer percentage (0-100).

Missing Pagination Indexes: Tables like evalResult will grow large but lack proper indexes for efficient pagination.

4. API & Type Safety

Schema Evolution: JSON schemas in evaluator.schema have no versioning strategy. Consider:

  • Adding schemaVersion field
  • Migration strategy for schema changes
  • Validation of stored vs expected schemas

Type Imports: The schema imports ModelSettings from validation schemas, creating a circular dependency risk if validation schemas import from DB schemas.

🔧 Specific Recommendations

  1. Add Missing Indexes for query performance
  2. Standardize Primary Keys across all tables
  3. Enhance Status Enums with proper type constraints
  4. Consider Schema Versioning for JSON configurations
  5. Add Documentation explaining the evaluation workflow and table relationships
  6. Fix Foreign Key Nullability for datasetItemId in evalResult

📋 Changeset Required

The PR needs a changeset as noted by the bot. Since this adds new database schema tables, it should be at least a minor version bump.

🏁 Conclusion

This is a solid foundation for an evaluation system with good architectural patterns. The main concerns are around consistency with existing patterns, performance optimization, and type safety. The suggested improvements would make this production-ready and maintainable long-term.

The evaluation workflow design is well thought out and the schema supports flexible use cases while maintaining data integrity.


claude[bot] avatar Nov 10 '25 22:11 claude[bot]