Memory optimization needed for large crawls - System becomes unresponsive
Memory Optimization Issue for Large Crawls
Problem Description
During large crawls, Archon fills up the host memory and causes the server to become unresponsive. This has required increasing RAM to 32GB on VMs, but the issue persists. The root cause appears to be that crawled data accumulates in memory without proper cleanup.
Investigation Summary
Key Memory Issues Identified
1. Accumulation of Data in Memory
- The
url_to_full_documentdictionary stores entire document contents in memory for all crawled pages - The batch crawling accumulates all results in
successful_resultslist before processing - All chunks, metadata, and contents are accumulated in lists before batch processing
2. No Memory Cleanup
- No explicit garbage collection or memory clearing between batches
- Results from processed batches are kept in memory throughout the entire crawl
- The
url_to_full_documentdictionary is never cleared during the crawl
3. Large Batch Sizes
- Default batch sizes (50 URLs, 50 documents) can consume significant memory
- When crawling large documentation sites, each page can be several MB
- With 50 pages × 2-3 MB each = 100-150 MB just for raw content
4. Memory Monitoring But No Action
- The system has
MemoryAdaptiveDispatcherthat monitors memory - It reduces concurrent workers when memory is high but doesn't clear existing data
- Memory threshold is set to 80% but only affects new crawls, not existing data
Proposed Solution
Phase 1: Immediate Memory Management (Quick Fixes)
-
Add garbage collection between batches
- Import
gcmodule and callgc.collect()after processing each batch - Clear large data structures after they're no longer needed
- Add memory cleanup in strategic locations
- Import
-
Implement streaming processing
- Process and store documents immediately instead of accumulating
- Clear
url_to_full_documentafter code extraction for each batch - Use generators where possible to avoid loading all data at once
-
Add memory limits and circuit breakers
- Stop crawling when memory usage exceeds threshold
- Implement automatic batch size reduction based on memory usage
- Add configuration for maximum memory per crawl
Phase 2: Architectural Improvements
-
Implement chunked processing pipeline
- Process URLs in smaller micro-batches (5-10 URLs)
- Store to database after each micro-batch
- Clear memory between micro-batches
-
Add memory-aware batch sizing
- Dynamically adjust batch size based on available memory
- Track memory usage per document and predict batch memory needs
- Implement backpressure mechanism
-
Implement result streaming
- Stream results directly to database without accumulation
- Use async generators for processing pipelines
- Implement write-through cache with size limits
Phase 3: Configuration and Monitoring
-
Add memory management settings
MAX_MEMORY_PER_CRAWLsetting (in GB)AGGRESSIVE_GC_MODEfor memory-constrained environmentsBATCH_MEMORY_LIMITfor per-batch limits
-
Add memory metrics and logging
- Log memory usage before/after each batch
- Track peak memory usage during crawls
- Add memory alerts and notifications
Files Requiring Modification
-
python/src/server/services/crawling/strategies/batch.py- Add gc.collect() after each batch
- Clear successful_results periodically
- Implement streaming result processing
-
python/src/server/services/crawling/document_storage_operations.py- Clear url_to_full_document after processing
- Process documents in smaller chunks
- Add memory cleanup between document batches
-
python/src/server/services/storage/document_storage_service.py- Implement streaming storage without accumulation
- Add batch size limits based on memory
- Clear embedding cache after each batch
-
python/src/server/services/crawling/crawling_service.py- Add memory monitoring and limits
- Implement graceful degradation when memory is high
- Add cleanup methods for memory management
Expected Improvements
- 50-70% reduction in peak memory usage
- Ability to crawl larger sites without OOM
- Better performance on memory-constrained systems
- Graceful handling of memory pressure
Reproduction Steps
- Start a crawl of a large documentation site (e.g., React docs, MDN)
- Monitor system memory usage with
htopor similar - Observe memory growing continuously without cleanup
- System becomes unresponsive when memory fills up
Environment
- Affects all deployment configurations
- Most severe on systems with <32GB RAM
- Problem scales with crawl size
Priority
High - This is a critical issue affecting production deployments and limiting the ability to crawl large knowledge bases.
Related Code Locations
- Memory monitoring:
python/src/server/services/threading_service.py - Batch processing:
python/src/server/services/crawling/strategies/batch.py - Document storage:
python/src/server/services/crawling/document_storage_operations.py - Storage service:
python/src/server/services/storage/document_storage_service.py
Suggested Labels
- bug
- performance
- enhancement
- memory
Hi @tazmon95 @coleam00 :
This issue is still being worked out? I have a mac running archon with localai (ollama outside of docker) and cannot process mem0 for web crawls or PDF files bigger than 17Mb.
Nevermind: RAG processing also depends on how much memory is allocated to docker. I configured it to have 1'5Gb free in excess of what the containers need. I changed it so it has 5Gb and now it can process any file