weave icon indicating copy to clipboard operation
weave copied to clipboard

feat(weave): Add Database Migration for Queue-Based Call Annotation System

Open chance-wnb opened this issue 1 month ago • 4 comments

Summary

This PR introduces the database schema for a queue-based call annotation system in Weave Trace. The system enables teams to create annotation queues, manually curate calls for review, and collaborate on systematic call annotation with multi-annotator support.

https://wandb.atlassian.net/browse/WB-29392

Database Schema Overview

The migration creates three core tables that work together to support annotation workflows:

1. annotation_queues - Queue Definitions

Stores queue metadata and configuration:

  • Queue identity: UUID-based unique identifier
  • Scorers: Array of weave refs (e.g., weave:///entity/project/scorer/name:version) for multi-dimensional annotation
  • Claim timeout: Configurable duration before claimed items expire (default: 30 minutes)
  • Soft delete support: deleted_at for non-destructive queue archival

Key Features:

  • Native Array(String) for scorer refs (efficient access without JSON parsing)
  • Lightweight UPDATE support for modifying queue properties

2. annotation_queue_items - Queue Membership

Tracks which calls belong to each queue using a proxy pattern:

  • Call membership: Links queues to calls without modifying the immutable calls_merged table
  • Cached call fields: Snapshots of started_at, ended_at, op_name, trace_id at add-time for fast sorting/filtering without joins
  • Display fields: Per-batch configuration of which JSON paths to show annotators (e.g., ['input.prompt', 'output.text'])
  • Multi-queue support: Same call can appear in multiple queues with independent state

Why the Proxy Pattern?

  • ✅ Keeps annotation workflow metadata separate from core call data
  • ✅ Enables multi-queue support (same call in different queues)
  • ✅ Fast queries without expensive joins to the massive calls_merged table
  • ✅ Snapshot semantics: queue items capture call state at add-time

3. annotator_queue_items_progress - Workflow State Tracking

Manages per-annotator progress through annotation workflows:

  • Workflow states: pendingclaimedcompleted or skipped
  • Claim management: Prevents duplicate work via temporary claims with expiration
  • Denormalized fields: queue_id and call_id for efficient querying without joins
  • Multi-annotator ready: Schema supports both shared pool (MVP) and future N-annotations-per-call workflows

Claim Expiration Mechanism:

  • When annotator claims an item: claim_expires_at = now() + claim_timeout_seconds
  • Active claims prevent other annotators from working on the same item
  • Expired claims can be reclaimed by any annotator (no background jobs needed)

Index Strategy (ORDER BY Clause Rationale)

ClickHouse's ORDER BY clause determines both physical sort order and the primary key index. Our indexing strategy prioritizes direct lookups by UUID and efficient JOINs over uniqueness enforcement (which is handled at the application layer).

annotation_queues: ORDER BY (project_id, id)

Optimizes:

  • ✅ Direct queue lookups: SELECT * FROM annotation_queues WHERE id = ?
  • ✅ List queues in a project: SELECT * FROM annotation_queues WHERE project_id = ?

Rationale: Queues are accessed primarily by their UUID identifier. Project-level scoping ensures tenant isolation.

annotation_queue_items: ORDER BY (project_id, queue_id, id)

Optimizes:

  • ✅ Fast JOIN lookups: JOIN annotation_queue_items qi ON qi.id = p.queue_item_id (primary use case)
  • ✅ Direct item access: SELECT * FROM annotation_queue_items WHERE id = ?
  • ✅ List all items in queue: SELECT * FROM annotation_queue_items WHERE queue_id = ?

Why NOT ORDER BY (project_id, queue_id, call_id)?

I considered including call_id for duplicate checking optimization:

-- Duplicate check query (runs during batch add)
SELECT call_id FROM annotation_queue_items
WHERE project_id = ? AND queue_id = ? AND call_id IN (...)

However, we chose to prioritize id because:

  1. call_id doesn't provide uniqueness: The same call can appear in multiple queues (multi-queue support is a core feature)
  2. JOIN frequency >> duplicate check frequency:
    • JOINs happen on every annotation action (read display fields, check state)
    • Duplicate checks only happen during batch add operations (less frequent)
  3. Small dataset per queue: With typical queue sizes (100s-1000s of items), scanning queue granules to filter by call_id is fast enough (microseconds)
  4. Direct item API access: Many endpoints use GET /annotation_queue_items/{item_id}, which requires fast UUID lookup

Trade-off Accepted: Duplicate checking requires scanning all granules in a queue and filtering by call_id, but this is negligible for typical queue sizes.

annotator_queue_items_progress: ORDER BY (project_id, queue_id, id)

Optimizes:

  • ✅ Direct progress record access: PATCH /annotation_queue_progress/{progress_id}
  • ✅ List all progress in queue: SELECT * FROM annotator_queue_items_progress WHERE queue_id = ?

Why NOT include annotation_state in ORDER BY?

I considered ORDER BY (project_id, queue_id, annotation_state, id) to optimize:

-- Query for available work
SELECT * FROM annotator_queue_items_progress
WHERE queue_id = ? AND annotation_state = 'pending'

However, I chose NOT to include annotation_state because:

  1. Causes data reorganization on UPDATE: Including annotation_state in the sort key means every state transition (pendingclaimedcompleted) physically moves the row on disk
  2. High update frequency: Items transition through states constantly during annotation workflows
  3. Defeats lightweight UPDATE optimization: The migration specifically enables ClickHouse's lightweight UPDATE feature (enable_block_number_column), which works best when updated columns are NOT in the sort key
  4. Minimal query performance impact: With typical queue sizes (100s-1000s), scanning all queue granules and filtering by state is fast enough

ClickHouse Granule Scanning:

  • Without annotation_state in ORDER BY: Scan ~2 granules (8192 rows each) for a 10K-item queue
  • With annotation_state in ORDER BY: Scan ~0.25 granules (only 'pending' items)
  • At this scale, the 4x difference is negligible (microseconds) compared to data reorganization overhead (milliseconds + merge overhead)

Trade-off Accepted: State filtering requires scanning all queue granules, but this optimizes for the more critical update path.

Technical Details

Lightweight UPDATE Support

All three tables enable ClickHouse's lightweight UPDATE feature:

SETTINGS
    enable_block_number_column = 1,
    enable_block_offset_column = 1;

This allows efficient in-place updates for:

  • display_fields in annotation_queue_items (if needed)
  • annotation_state, claimed_at, completed_at in annotator_queue_items_progress

Note: This adds minimal overhead to SELECT queries (patch application) but dramatically improves UPDATE performance.

Soft Delete Pattern

All tables include deleted_at Nullable(DateTime64(3)) for non-destructive deletion:

  • NULL = active record
  • Non-NULL = soft deleted (timestamp of deletion)

Related Documentation

Full design document: Queue-Based Call Annotation System.md


This PR is database-only and introduces no breaking changes. The tables are new and do not modify existing schema.

chance-wnb avatar Nov 20 '25 19:11 chance-wnb

This stack of pull requests is managed by Graphite. Learn more about stacking.

chance-wnb avatar Nov 20 '25 19:11 chance-wnb

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests.

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar Nov 20 '25 19:11 codecov[bot]

❌ Documentation Reference Check Failed

No documentation reference found in the PR description. Please add either:

  • A link to a docs PR (format: wandb/docs#XXX or https://github.com/wandb/docs/pull/XXX)
  • A Jira ticket reference (format: DOCS-XXX or https://wandb.atlassian.net/browse/DOCS-XXX)

This check is required for all PRs except those that start with "chore(weave)" or explicitly state "docs are not required". Please update your PR description and this check will run again automatically.

github-actions[bot] avatar Nov 22 '25 02:11 github-actions[bot]