🐛 [Bug]: Archon Crawling System Issues

Open elektrikmoonkey opened this issue 3 months ago • 1 comments

Archon Version

Stable: Latest from September 29

Bug Severity

🔴 Critical - App unusable

Bug Description

BUG REPORT: CRITICAL ARCHON CRAWLING SYSTEM ISSUES

BIG BUG: MASSIVE DUPLICATE DATA CREATION AND REACT FRONTEND CRASHES

PROBLEM SUMMARY:

The Archon system has a critical bug where interrupted crawling operations create massive duplicate entries in the database (over 35,000+ duplicates found), which then causes React "dispatcher is null" errors that completely break the frontend UI.

WHEN THIS BUG OCCURS:

WiFi/Network Disconnection During Crawling:
- When crawling runs overnight and WiFi drops
- When internet connection is interrupted mid-crawl
- When user manually stops the crawling process
Crawl Restart/Retry Scenarios:
- When user hits "STOP" button during active crawling
- When crawling is interrupted for any reason (power, network, manual stop)
- When user attempts to crawl the same URLs again
- When system tries to resume interrupted operations
Database Corruption Points:
- Each interrupted crawl creates 100s-1000s of duplicate entries
- Multiple entries with identical source_id and created_at timestamps
- Frontend tries to render all duplicates causing React hook errors

CURRENT BEHAVIOR:

Before fix: 88,303 total entries → 40.7% were duplicates!
React frontend crashes with "Invalid hook call" and "dispatcher is null" errors
Frontend becomes completely unusable
Massive database bloat with duplicate content
System unable to start properly due to port conflicts and React errors

REQUIRED FEATURES TO FIX:

1. RESUMABLE CRAWLING SYSTEM:

"Continue from where it stopped" button
Progress tracking - save crawl progress to database
Checkpoint system - save state every X pages/chunks
Smart resume - detect incomplete crawls and offer to continue

2. DUPLICATE PREVENTION:

Atomic transactions - rollback incomplete crawls
Unique constraints - prevent duplicate URL/source_id combinations
Smart deduplication - check for existing content before inserting

3. CRITICAL DATABASE CLEANUP COMMAND:

This command should run automatically on startup/shutdown to maintain database health:

# Add this to startup scripts for automatic optimization
docker exec -it supabase_db_supabase-local psql -U postgres -d postgres -c "
DELETE FROM archon_crawled_pages 
WHERE id NOT IN (
    SELECT MIN(id)
    FROM archon_crawled_pages
    GROUP BY url
);"

2ND PROBLEM: DOCKER CONTAINER PORT CONFLICTS

ISSUE:

Old Docker containers don't shut down properly
Ports (3737, 8051, 8181) remain allocated to zombie processes
New containers can't start due to "port already allocated" errors
Creates cascading failures across the entire system
System becomes completely unable to start until manual cleanup

SOLUTION:

Graceful shutdown handling for all Docker services
Port conflict detection and cleanup on startup
Health checks to ensure containers are properly terminated

3RD PROBLEM: MULTIPLE REACT INSTANCES

ISSUE:

Docker frontend image contains multiple React copies
Causes "Invalid hook call" and "dispatcher is null" errors
Frontend becomes completely unusable
System unable to start properly due to React dependency conflicts

SOLUTION:

Proper dependency management in Docker build
Single React instance enforcement
Clean npm install during Docker build process

IMMEDIATE ACTION REQUIRED:

Implement resumable crawling to prevent user frustration
Add automatic deduplication on startup/shutdown
Fix React error handling to gracefully handle duplicate data
Add progress tracking to prevent complete restarts
Implement proper transaction handling to prevent partial data writes
Fix Docker build process to prevent multiple React instances
Add automatic port conflict resolution to ensure system can always start

This is a CRITICAL bug that makes the entire system unusable after any crawling interruption. Users lose hours of crawling progress and face complete UI crashes. The system can become completely unable to start due to port conflicts and React dependency issues.

Steps to Reproduce

Explain above

Expected Behavior

Explain above

Actual Behavior

Explain above

Error Details (if any)

Explain above

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Linux Mint, all

Additional Context

Explain above

Service Status (check all that are working)

[x] 🖥️ Frontend UI (http://localhost:3737)
[x] ⚙️ Main Server (http://localhost:8181)
[x] 🔗 MCP Service (localhost:8051)
[ ] 🤖 Agents Service (http://localhost:8052)
[ ] 💾 Supabase Database (connected)

Oct 02 '25 18:10 elektrikmoonkey

+1 from my side, crawling starts and never completes for me, have re-installed Archon several times, both with local and remote Supabase instances...very buggy

Oct 24 '25 12:10 dgtise25