Archon icon indicating copy to clipboard operation
Archon copied to clipboard

Memory optimization needed for large crawls - System becomes unresponsive

Open tazmon95 opened this issue 3 months ago • 1 comments

Memory Optimization Issue for Large Crawls

Problem Description

During large crawls, Archon fills up the host memory and causes the server to become unresponsive. This has required increasing RAM to 32GB on VMs, but the issue persists. The root cause appears to be that crawled data accumulates in memory without proper cleanup.

Investigation Summary

Key Memory Issues Identified

1. Accumulation of Data in Memory

  • The url_to_full_document dictionary stores entire document contents in memory for all crawled pages
  • The batch crawling accumulates all results in successful_results list before processing
  • All chunks, metadata, and contents are accumulated in lists before batch processing

2. No Memory Cleanup

  • No explicit garbage collection or memory clearing between batches
  • Results from processed batches are kept in memory throughout the entire crawl
  • The url_to_full_document dictionary is never cleared during the crawl

3. Large Batch Sizes

  • Default batch sizes (50 URLs, 50 documents) can consume significant memory
  • When crawling large documentation sites, each page can be several MB
  • With 50 pages × 2-3 MB each = 100-150 MB just for raw content

4. Memory Monitoring But No Action

  • The system has MemoryAdaptiveDispatcher that monitors memory
  • It reduces concurrent workers when memory is high but doesn't clear existing data
  • Memory threshold is set to 80% but only affects new crawls, not existing data

Proposed Solution

Phase 1: Immediate Memory Management (Quick Fixes)

  1. Add garbage collection between batches

    • Import gc module and call gc.collect() after processing each batch
    • Clear large data structures after they're no longer needed
    • Add memory cleanup in strategic locations
  2. Implement streaming processing

    • Process and store documents immediately instead of accumulating
    • Clear url_to_full_document after code extraction for each batch
    • Use generators where possible to avoid loading all data at once
  3. Add memory limits and circuit breakers

    • Stop crawling when memory usage exceeds threshold
    • Implement automatic batch size reduction based on memory usage
    • Add configuration for maximum memory per crawl

Phase 2: Architectural Improvements

  1. Implement chunked processing pipeline

    • Process URLs in smaller micro-batches (5-10 URLs)
    • Store to database after each micro-batch
    • Clear memory between micro-batches
  2. Add memory-aware batch sizing

    • Dynamically adjust batch size based on available memory
    • Track memory usage per document and predict batch memory needs
    • Implement backpressure mechanism
  3. Implement result streaming

    • Stream results directly to database without accumulation
    • Use async generators for processing pipelines
    • Implement write-through cache with size limits

Phase 3: Configuration and Monitoring

  1. Add memory management settings

    • MAX_MEMORY_PER_CRAWL setting (in GB)
    • AGGRESSIVE_GC_MODE for memory-constrained environments
    • BATCH_MEMORY_LIMIT for per-batch limits
  2. Add memory metrics and logging

    • Log memory usage before/after each batch
    • Track peak memory usage during crawls
    • Add memory alerts and notifications

Files Requiring Modification

  1. python/src/server/services/crawling/strategies/batch.py

    • Add gc.collect() after each batch
    • Clear successful_results periodically
    • Implement streaming result processing
  2. python/src/server/services/crawling/document_storage_operations.py

    • Clear url_to_full_document after processing
    • Process documents in smaller chunks
    • Add memory cleanup between document batches
  3. python/src/server/services/storage/document_storage_service.py

    • Implement streaming storage without accumulation
    • Add batch size limits based on memory
    • Clear embedding cache after each batch
  4. python/src/server/services/crawling/crawling_service.py

    • Add memory monitoring and limits
    • Implement graceful degradation when memory is high
    • Add cleanup methods for memory management

Expected Improvements

  • 50-70% reduction in peak memory usage
  • Ability to crawl larger sites without OOM
  • Better performance on memory-constrained systems
  • Graceful handling of memory pressure

Reproduction Steps

  1. Start a crawl of a large documentation site (e.g., React docs, MDN)
  2. Monitor system memory usage with htop or similar
  3. Observe memory growing continuously without cleanup
  4. System becomes unresponsive when memory fills up

Environment

  • Affects all deployment configurations
  • Most severe on systems with <32GB RAM
  • Problem scales with crawl size

Priority

High - This is a critical issue affecting production deployments and limiting the ability to crawl large knowledge bases.

Related Code Locations

  • Memory monitoring: python/src/server/services/threading_service.py
  • Batch processing: python/src/server/services/crawling/strategies/batch.py
  • Document storage: python/src/server/services/crawling/document_storage_operations.py
  • Storage service: python/src/server/services/storage/document_storage_service.py

Suggested Labels

  • bug
  • performance
  • enhancement
  • memory

tazmon95 avatar Sep 20 '25 21:09 tazmon95

Hi @tazmon95 @coleam00 :

This issue is still being worked out? I have a mac running archon with localai (ollama outside of docker) and cannot process mem0 for web crawls or PDF files bigger than 17Mb.

Nevermind: RAG processing also depends on how much memory is allocated to docker. I configured it to have 1'5Gb free in excess of what the containers need. I changed it so it has 5Gb and now it can process any file

Tete-Cohete avatar Oct 13 '25 19:10 Tete-Cohete