Archon icon indicating copy to clipboard operation
Archon copied to clipboard

🐛 [Bug]: Archon Crawling System Issues

Open elektrikmoonkey opened this issue 3 months ago • 1 comments

Archon Version

Stable: Latest from September 29

Bug Severity

🔴 Critical - App unusable

Bug Description

BUG REPORT: CRITICAL ARCHON CRAWLING SYSTEM ISSUES

BIG BUG: MASSIVE DUPLICATE DATA CREATION AND REACT FRONTEND CRASHES

PROBLEM SUMMARY:

The Archon system has a critical bug where interrupted crawling operations create massive duplicate entries in the database (over 35,000+ duplicates found), which then causes React "dispatcher is null" errors that completely break the frontend UI.


WHEN THIS BUG OCCURS:

  1. WiFi/Network Disconnection During Crawling:

    • When crawling runs overnight and WiFi drops
    • When internet connection is interrupted mid-crawl
    • When user manually stops the crawling process
  2. Crawl Restart/Retry Scenarios:

    • When user hits "STOP" button during active crawling
    • When crawling is interrupted for any reason (power, network, manual stop)
    • When user attempts to crawl the same URLs again
    • When system tries to resume interrupted operations
  3. Database Corruption Points:

    • Each interrupted crawl creates 100s-1000s of duplicate entries
    • Multiple entries with identical source_id and created_at timestamps
    • Frontend tries to render all duplicates causing React hook errors

CURRENT BEHAVIOR:

  • Before fix: 88,303 total entries → 40.7% were duplicates!
  • React frontend crashes with "Invalid hook call" and "dispatcher is null" errors
  • Frontend becomes completely unusable
  • Massive database bloat with duplicate content
  • System unable to start properly due to port conflicts and React errors

REQUIRED FEATURES TO FIX:

1. RESUMABLE CRAWLING SYSTEM:

  • "Continue from where it stopped" button
  • Progress tracking - save crawl progress to database
  • Checkpoint system - save state every X pages/chunks
  • Smart resume - detect incomplete crawls and offer to continue

2. DUPLICATE PREVENTION:

  • Atomic transactions - rollback incomplete crawls
  • Unique constraints - prevent duplicate URL/source_id combinations
  • Smart deduplication - check for existing content before inserting

3. CRITICAL DATABASE CLEANUP COMMAND:

This command should run automatically on startup/shutdown to maintain database health:

# Add this to startup scripts for automatic optimization
docker exec -it supabase_db_supabase-local psql -U postgres -d postgres -c "
DELETE FROM archon_crawled_pages 
WHERE id NOT IN (
    SELECT MIN(id)
    FROM archon_crawled_pages
    GROUP BY url
);"

2ND PROBLEM: DOCKER CONTAINER PORT CONFLICTS

ISSUE:

  • Old Docker containers don't shut down properly
  • Ports (3737, 8051, 8181) remain allocated to zombie processes
  • New containers can't start due to "port already allocated" errors
  • Creates cascading failures across the entire system
  • System becomes completely unable to start until manual cleanup

SOLUTION:

  • Graceful shutdown handling for all Docker services
  • Port conflict detection and cleanup on startup
  • Health checks to ensure containers are properly terminated

3RD PROBLEM: MULTIPLE REACT INSTANCES

ISSUE:

  • Docker frontend image contains multiple React copies
  • Causes "Invalid hook call" and "dispatcher is null" errors
  • Frontend becomes completely unusable
  • System unable to start properly due to React dependency conflicts

SOLUTION:

  • Proper dependency management in Docker build
  • Single React instance enforcement
  • Clean npm install during Docker build process

IMMEDIATE ACTION REQUIRED:

  1. Implement resumable crawling to prevent user frustration
  2. Add automatic deduplication on startup/shutdown
  3. Fix React error handling to gracefully handle duplicate data
  4. Add progress tracking to prevent complete restarts
  5. Implement proper transaction handling to prevent partial data writes
  6. Fix Docker build process to prevent multiple React instances
  7. Add automatic port conflict resolution to ensure system can always start

This is a CRITICAL bug that makes the entire system unusable after any crawling interruption. Users lose hours of crawling progress and face complete UI crashes. The system can become completely unable to start due to port conflicts and React dependency issues.

Steps to Reproduce

Explain above

Expected Behavior

Explain above

Actual Behavior

Explain above

Error Details (if any)

Explain above

Affected Component

🔍 Knowledge Base / RAG

Browser & OS

Linux Mint, all

Additional Context

Explain above

Service Status (check all that are working)

  • [x] 🖥️ Frontend UI (http://localhost:3737)
  • [x] ⚙️ Main Server (http://localhost:8181)
  • [x] 🔗 MCP Service (localhost:8051)
  • [ ] 🤖 Agents Service (http://localhost:8052)
  • [ ] 💾 Supabase Database (connected)

elektrikmoonkey avatar Oct 02 '25 18:10 elektrikmoonkey

+1 from my side, crawling starts and never completes for me, have re-installed Archon several times, both with local and remote Supabase instances...very buggy

dgtise25 avatar Oct 24 '25 12:10 dgtise25