Archon
Archon copied to clipboard
Discussion: Handling partial storage failures in document crawling
Context
During code review, we identified a potential silent data loss scenario in the document storage operations where chunks can be created but fail to store in the database, yet the crawl still returns as "successful".
Current Behavior
In document_storage_operations.py, when process_and_store_documents completes:
- Chunks are created from crawled content
- Storage to Supabase is attempted
- If storage fails (e.g., foreign key violations, database errors), the function still returns with
chunks_stored: 0 - The crawl appears successful to the user, but no data was actually saved
Potential Causes of Storage Failure
- Database connectivity issues - Supabase down, network timeouts
- Foreign key violations - Source record doesn't exist in
archon_sourcestable - Database constraints - Unique violations, row size limits, quota exceeded
- Data validation - Invalid JSON, encoding issues
Proposed Solutions
Option 1: Fail Fast (Recommended for production)
class DocumentStorageError(RuntimeError):
pass
# In process_and_store_documents, before return:
if chunk_count > 0 and storage_stats.get("chunks_stored", 0) == 0:
raise DocumentStorageError(
f"Failed to store any chunks for {original_source_id}: "
f"Created {chunk_count} chunks but stored 0"
)
Option 2: Enhanced Logging (Current approach for beta)
- Keep current behavior but add prominent warning logs
- Allows partial data collection during development
- Users can still get some data even if storage partially fails
Trade-offs
Fail Fast:
- ✅ Prevents silent data loss
- ✅ Forces immediate issue resolution
- ❌ Stops entire crawl on storage errors
- ❌ May be too strict during beta when we want to gather as much data as possible
Current Logging:
- ✅ Allows crawls to complete even with partial failures
- ✅ Better for beta testing and development
- ❌ Risk of unnoticed data loss
- ❌ Harder to debug storage issues
Decision for Beta
For now, we're keeping the logging approach to allow crawls to complete and gather as much data as possible during beta. This should be revisited before production release.
Questions for Discussion
- Should we add a user-visible warning when chunks_stored < chunk_count?
- Should we set a threshold (e.g., fail if < 50% chunks stored)?
- Should this be a configurable behavior (strict vs permissive mode)?
- How should we handle partial batch failures vs complete failures?
Related PR: #514 (refactor-remove-sockets branch)