🐛 [Bug]: Archon Crawling System Issues
Archon Version
Stable: Latest from September 29
Bug Severity
🔴 Critical - App unusable
Bug Description
BUG REPORT: CRITICAL ARCHON CRAWLING SYSTEM ISSUES
BIG BUG: MASSIVE DUPLICATE DATA CREATION AND REACT FRONTEND CRASHES
PROBLEM SUMMARY:
The Archon system has a critical bug where interrupted crawling operations create massive duplicate entries in the database (over 35,000+ duplicates found), which then causes React "dispatcher is null" errors that completely break the frontend UI.
WHEN THIS BUG OCCURS:
-
WiFi/Network Disconnection During Crawling:
- When crawling runs overnight and WiFi drops
- When internet connection is interrupted mid-crawl
- When user manually stops the crawling process
-
Crawl Restart/Retry Scenarios:
- When user hits "STOP" button during active crawling
- When crawling is interrupted for any reason (power, network, manual stop)
- When user attempts to crawl the same URLs again
- When system tries to resume interrupted operations
-
Database Corruption Points:
- Each interrupted crawl creates 100s-1000s of duplicate entries
- Multiple entries with identical
source_idandcreated_attimestamps - Frontend tries to render all duplicates causing React hook errors
CURRENT BEHAVIOR:
- Before fix: 88,303 total entries → 40.7% were duplicates!
- React frontend crashes with "Invalid hook call" and "dispatcher is null" errors
- Frontend becomes completely unusable
- Massive database bloat with duplicate content
- System unable to start properly due to port conflicts and React errors
REQUIRED FEATURES TO FIX:
1. RESUMABLE CRAWLING SYSTEM:
- "Continue from where it stopped" button
- Progress tracking - save crawl progress to database
- Checkpoint system - save state every X pages/chunks
- Smart resume - detect incomplete crawls and offer to continue
2. DUPLICATE PREVENTION:
- Atomic transactions - rollback incomplete crawls
- Unique constraints - prevent duplicate URL/source_id combinations
- Smart deduplication - check for existing content before inserting
3. CRITICAL DATABASE CLEANUP COMMAND:
This command should run automatically on startup/shutdown to maintain database health:
# Add this to startup scripts for automatic optimization
docker exec -it supabase_db_supabase-local psql -U postgres -d postgres -c "
DELETE FROM archon_crawled_pages
WHERE id NOT IN (
SELECT MIN(id)
FROM archon_crawled_pages
GROUP BY url
);"
2ND PROBLEM: DOCKER CONTAINER PORT CONFLICTS
ISSUE:
- Old Docker containers don't shut down properly
- Ports (3737, 8051, 8181) remain allocated to zombie processes
- New containers can't start due to "port already allocated" errors
- Creates cascading failures across the entire system
- System becomes completely unable to start until manual cleanup
SOLUTION:
- Graceful shutdown handling for all Docker services
- Port conflict detection and cleanup on startup
- Health checks to ensure containers are properly terminated
3RD PROBLEM: MULTIPLE REACT INSTANCES
ISSUE:
- Docker frontend image contains multiple React copies
- Causes "Invalid hook call" and "dispatcher is null" errors
- Frontend becomes completely unusable
- System unable to start properly due to React dependency conflicts
SOLUTION:
- Proper dependency management in Docker build
- Single React instance enforcement
- Clean npm install during Docker build process
IMMEDIATE ACTION REQUIRED:
- Implement resumable crawling to prevent user frustration
- Add automatic deduplication on startup/shutdown
- Fix React error handling to gracefully handle duplicate data
- Add progress tracking to prevent complete restarts
- Implement proper transaction handling to prevent partial data writes
- Fix Docker build process to prevent multiple React instances
- Add automatic port conflict resolution to ensure system can always start
This is a CRITICAL bug that makes the entire system unusable after any crawling interruption. Users lose hours of crawling progress and face complete UI crashes. The system can become completely unable to start due to port conflicts and React dependency issues.
Steps to Reproduce
Explain above
Expected Behavior
Explain above
Actual Behavior
Explain above
Error Details (if any)
Explain above
Affected Component
🔍 Knowledge Base / RAG
Browser & OS
Linux Mint, all
Additional Context
Explain above
Service Status (check all that are working)
- [x] 🖥️ Frontend UI (http://localhost:3737)
- [x] ⚙️ Main Server (http://localhost:8181)
- [x] 🔗 MCP Service (localhost:8051)
- [ ] 🤖 Agents Service (http://localhost:8052)
- [ ] 💾 Supabase Database (connected)
+1 from my side, crawling starts and never completes for me, have re-installed Archon several times, both with local and remote Supabase instances...very buggy