Simplify Progress Tracking System - Remove Over-Engineering

Open Wirasm opened this issue 3 months ago • 0 comments

Progress Tracking System Simplification

Problem Statement

The current progress tracking system is overly complex and difficult to debug. A recent investigation into a simple off-by-one error required tracing through 6+ abstraction layers and multiple counter variables. This complexity makes maintenance and debugging extremely difficult.

Current System Issues

1. Too Many Abstraction Layers

Data flows through at least 6 layers:

Crawling Strategy → Progress Callback → ProgressMapper → ProgressTracker → API → Frontend

2. Multiple Sources of Truth

Found 4+ different counter variables in recursive strategy alone:

total_processed
current_idx
depth_successful
total_discovered

3. Complex Progress Mapping

The ProgressMapper class maps stage-specific progress (0-100%) to overall progress ranges, adding complexity without clear user benefit.

4. Distributed Counter Management

Counters are incremented in different places, progress calculated in others, and displayed values come from yet another location.

5. Inconsistent Naming

pages_crawled vs processed_pages vs current_idx - same concept, different names.

Why It's Hard to Debug

No Single Source of Truth - Values can diverge at any layer
Complex State Flow - Progress state travels through multiple transformations
Hidden Dependencies - Frontend depends on specific field names from backend
Mixed Concerns - Overall progress, stage progress, and counter values all mixed together

Recommended Solution: Radical Simplification

Single Progress Object Pattern

Create one unified progress object that contains all necessary tracking information:

interface CrawlProgress {
  // Core counters
  pagesProcessed: number;
  totalPages: number;
  documentsCreated: number;
  codeExamplesFound: number;
  errors: number;
  
  // Strategy tracking (as requested)
  crawlType: 'normal' | 'sitemap' | 'llms-txt' | 'text_file';
  strategy: 'single' | 'batch' | 'recursive' | 'sitemap';
  
  // Current state
  status: string;
  currentUrl?: string;
  currentBatch?: number;
  totalBatches?: number;
  
  // Timestamps
  startedAt: string;
  lastUpdated: string;
}

Implementation Approach

Single Progress Object - Created at crawl start, passed everywhere
One Place for Updates - All counter increments happen in one location
Direct API Serialization - No transformations or mappings
Consistent Naming - One name per concept throughout the system

Benefits

✅ 10x easier to debug - Single source of truth
✅ No counter discrepancies - All values come from same object
✅ Consistent naming - Same field names everywhere
✅ Still tracks granular info - Crawl type, strategy, batches, etc.
✅ Simpler maintenance - Less code, fewer bugs
✅ Better performance - No complex mapping calculations

What We Keep (Granular Tracking)

Crawl type detection (sitemap, llms.txt, normal)
Strategy identification (batch, recursive, single-page)
Detailed counters (pages, documents, code examples)
Batch processing metrics when applicable
Error tracking and current status

What We Remove

ProgressMapper class and stage mapping complexity
Multiple counter variables for same concept
Complex callback chains and transformations
Inconsistent field naming across layers

Implementation Priority

High Priority - This complexity is actively hindering development and debugging. Every progress-related investigation becomes a multi-hour archaeology expedition through abstraction layers.

Acceptance Criteria

[ ] Single progress object used throughout system
[ ] All counters incremented in one location only
[ ] Consistent field naming (no more pages_crawled vs processed_pages)
[ ] Direct JSON serialization to API (no ProgressMapper)
[ ] Granular tracking maintained (crawl type, strategy, batches)
[ ] Existing functionality preserved for end users
[ ] Debug time for progress issues reduced from hours to minutes

Related Issues

This addresses the root cause of issues like the recent off-by-one error where pages_crawled was always one higher than the progress message count. With a single progress object, such discrepancies become impossible.

Assessment: The current system suffers from premature optimization and over-abstraction. It tries to handle every possible progress scenario instead of focusing on the 80% use case. A simple, unified progress object would be vastly easier to maintain while preserving all necessary functionality.

Sep 11 '25 09:09 Wirasm