Memory optimization needed for large crawls - System becomes unresponsive

Open tazmon95 opened this issue 3 months ago • 1 comments

Memory Optimization Issue for Large Crawls

Problem Description

During large crawls, Archon fills up the host memory and causes the server to become unresponsive. This has required increasing RAM to 32GB on VMs, but the issue persists. The root cause appears to be that crawled data accumulates in memory without proper cleanup.

Investigation Summary

Key Memory Issues Identified

1. Accumulation of Data in Memory

The url_to_full_document dictionary stores entire document contents in memory for all crawled pages
The batch crawling accumulates all results in successful_results list before processing
All chunks, metadata, and contents are accumulated in lists before batch processing

2. No Memory Cleanup

No explicit garbage collection or memory clearing between batches
Results from processed batches are kept in memory throughout the entire crawl
The url_to_full_document dictionary is never cleared during the crawl

3. Large Batch Sizes

Default batch sizes (50 URLs, 50 documents) can consume significant memory
When crawling large documentation sites, each page can be several MB
With 50 pages × 2-3 MB each = 100-150 MB just for raw content

4. Memory Monitoring But No Action

The system has MemoryAdaptiveDispatcher that monitors memory
It reduces concurrent workers when memory is high but doesn't clear existing data
Memory threshold is set to 80% but only affects new crawls, not existing data

Proposed Solution

Phase 1: Immediate Memory Management (Quick Fixes)

Add garbage collection between batches
- Import gc module and call gc.collect() after processing each batch
- Clear large data structures after they're no longer needed
- Add memory cleanup in strategic locations
Implement streaming processing
- Process and store documents immediately instead of accumulating
- Clear url_to_full_document after code extraction for each batch
- Use generators where possible to avoid loading all data at once
Add memory limits and circuit breakers
- Stop crawling when memory usage exceeds threshold
- Implement automatic batch size reduction based on memory usage
- Add configuration for maximum memory per crawl

Phase 2: Architectural Improvements

Implement chunked processing pipeline
- Process URLs in smaller micro-batches (5-10 URLs)
- Store to database after each micro-batch
- Clear memory between micro-batches
Add memory-aware batch sizing
- Dynamically adjust batch size based on available memory
- Track memory usage per document and predict batch memory needs
- Implement backpressure mechanism
Implement result streaming
- Stream results directly to database without accumulation
- Use async generators for processing pipelines
- Implement write-through cache with size limits

Phase 3: Configuration and Monitoring

Add memory management settings
- MAX_MEMORY_PER_CRAWL setting (in GB)
- AGGRESSIVE_GC_MODE for memory-constrained environments
- BATCH_MEMORY_LIMIT for per-batch limits
Add memory metrics and logging
- Log memory usage before/after each batch
- Track peak memory usage during crawls
- Add memory alerts and notifications

Files Requiring Modification

python/src/server/services/crawling/strategies/batch.py
- Add gc.collect() after each batch
- Clear successful_results periodically
- Implement streaming result processing
python/src/server/services/crawling/document_storage_operations.py
- Clear url_to_full_document after processing
- Process documents in smaller chunks
- Add memory cleanup between document batches
python/src/server/services/storage/document_storage_service.py
- Implement streaming storage without accumulation
- Add batch size limits based on memory
- Clear embedding cache after each batch
python/src/server/services/crawling/crawling_service.py
- Add memory monitoring and limits
- Implement graceful degradation when memory is high
- Add cleanup methods for memory management

Expected Improvements

50-70% reduction in peak memory usage
Ability to crawl larger sites without OOM
Better performance on memory-constrained systems
Graceful handling of memory pressure

Reproduction Steps

Start a crawl of a large documentation site (e.g., React docs, MDN)
Monitor system memory usage with htop or similar
Observe memory growing continuously without cleanup
System becomes unresponsive when memory fills up

Environment

Affects all deployment configurations
Most severe on systems with <32GB RAM
Problem scales with crawl size

Priority

High - This is a critical issue affecting production deployments and limiting the ability to crawl large knowledge bases.

Related Code Locations

Memory monitoring: python/src/server/services/threading_service.py
Batch processing: python/src/server/services/crawling/strategies/batch.py
Document storage: python/src/server/services/crawling/document_storage_operations.py
Storage service: python/src/server/services/storage/document_storage_service.py

Suggested Labels

bug
performance
enhancement
memory

Sep 20 '25 21:09 tazmon95

Hi @tazmon95 @coleam00 :

This issue is still being worked out? I have a mac running archon with localai (ollama outside of docker) and cannot process mem0 for web crawls or PDF files bigger than 17Mb.

Nevermind: RAG processing also depends on how much memory is allocated to docker. I configured it to have 1'5Gb free in excess of what the containers need. I changed it so it has 5Gb and now it can process any file

Oct 13 '25 19:10 Tete-Cohete