Simplify Progress Tracking System - Remove Over-Engineering
Progress Tracking System Simplification
Problem Statement
The current progress tracking system is overly complex and difficult to debug. A recent investigation into a simple off-by-one error required tracing through 6+ abstraction layers and multiple counter variables. This complexity makes maintenance and debugging extremely difficult.
Current System Issues
1. Too Many Abstraction Layers
Data flows through at least 6 layers:
Crawling Strategy → Progress Callback → ProgressMapper → ProgressTracker → API → Frontend
2. Multiple Sources of Truth
Found 4+ different counter variables in recursive strategy alone:
total_processedcurrent_idxdepth_successfultotal_discovered
3. Complex Progress Mapping
The ProgressMapper class maps stage-specific progress (0-100%) to overall progress ranges, adding complexity without clear user benefit.
4. Distributed Counter Management
Counters are incremented in different places, progress calculated in others, and displayed values come from yet another location.
5. Inconsistent Naming
pages_crawled vs processed_pages vs current_idx - same concept, different names.
Why It's Hard to Debug
- No Single Source of Truth - Values can diverge at any layer
- Complex State Flow - Progress state travels through multiple transformations
- Hidden Dependencies - Frontend depends on specific field names from backend
- Mixed Concerns - Overall progress, stage progress, and counter values all mixed together
Recommended Solution: Radical Simplification
Single Progress Object Pattern
Create one unified progress object that contains all necessary tracking information:
interface CrawlProgress {
// Core counters
pagesProcessed: number;
totalPages: number;
documentsCreated: number;
codeExamplesFound: number;
errors: number;
// Strategy tracking (as requested)
crawlType: 'normal' | 'sitemap' | 'llms-txt' | 'text_file';
strategy: 'single' | 'batch' | 'recursive' | 'sitemap';
// Current state
status: string;
currentUrl?: string;
currentBatch?: number;
totalBatches?: number;
// Timestamps
startedAt: string;
lastUpdated: string;
}
Implementation Approach
- Single Progress Object - Created at crawl start, passed everywhere
- One Place for Updates - All counter increments happen in one location
- Direct API Serialization - No transformations or mappings
- Consistent Naming - One name per concept throughout the system
Benefits
- ✅ 10x easier to debug - Single source of truth
- ✅ No counter discrepancies - All values come from same object
- ✅ Consistent naming - Same field names everywhere
- ✅ Still tracks granular info - Crawl type, strategy, batches, etc.
- ✅ Simpler maintenance - Less code, fewer bugs
- ✅ Better performance - No complex mapping calculations
What We Keep (Granular Tracking)
- Crawl type detection (sitemap, llms.txt, normal)
- Strategy identification (batch, recursive, single-page)
- Detailed counters (pages, documents, code examples)
- Batch processing metrics when applicable
- Error tracking and current status
What We Remove
ProgressMapperclass and stage mapping complexity- Multiple counter variables for same concept
- Complex callback chains and transformations
- Inconsistent field naming across layers
Implementation Priority
High Priority - This complexity is actively hindering development and debugging. Every progress-related investigation becomes a multi-hour archaeology expedition through abstraction layers.
Acceptance Criteria
- [ ] Single progress object used throughout system
- [ ] All counters incremented in one location only
- [ ] Consistent field naming (no more pages_crawled vs processed_pages)
- [ ] Direct JSON serialization to API (no ProgressMapper)
- [ ] Granular tracking maintained (crawl type, strategy, batches)
- [ ] Existing functionality preserved for end users
- [ ] Debug time for progress issues reduced from hours to minutes
Related Issues
This addresses the root cause of issues like the recent off-by-one error where pages_crawled was always one higher than the progress message count. With a single progress object, such discrepancies become impossible.
Assessment: The current system suffers from premature optimization and over-abstraction. It tries to handle every possible progress scenario instead of focusing on the 80% use case. A simple, unified progress object would be vastly easier to maintain while preserving all necessary functionality.