Discussion: Handling partial storage failures in document crawling

Open Wirasm opened this issue 3 months ago • 0 comments

Context

During code review, we identified a potential silent data loss scenario in the document storage operations where chunks can be created but fail to store in the database, yet the crawl still returns as "successful".

Current Behavior

In document_storage_operations.py, when process_and_store_documents completes:

Chunks are created from crawled content
Storage to Supabase is attempted
If storage fails (e.g., foreign key violations, database errors), the function still returns with chunks_stored: 0
The crawl appears successful to the user, but no data was actually saved

Potential Causes of Storage Failure

Database connectivity issues - Supabase down, network timeouts
Foreign key violations - Source record doesn't exist in archon_sources table
Database constraints - Unique violations, row size limits, quota exceeded
Data validation - Invalid JSON, encoding issues

Proposed Solutions

Option 1: Fail Fast (Recommended for production)

class DocumentStorageError(RuntimeError):
    pass

# In process_and_store_documents, before return:
if chunk_count > 0 and storage_stats.get("chunks_stored", 0) == 0:
    raise DocumentStorageError(
        f"Failed to store any chunks for {original_source_id}: "
        f"Created {chunk_count} chunks but stored 0"
    )

Option 2: Enhanced Logging (Current approach for beta)

Keep current behavior but add prominent warning logs
Allows partial data collection during development
Users can still get some data even if storage partially fails

Trade-offs

Fail Fast:

✅ Prevents silent data loss
✅ Forces immediate issue resolution
❌ Stops entire crawl on storage errors
❌ May be too strict during beta when we want to gather as much data as possible

Current Logging:

✅ Allows crawls to complete even with partial failures
✅ Better for beta testing and development
❌ Risk of unnoticed data loss
❌ Harder to debug storage issues

Decision for Beta

For now, we're keeping the logging approach to allow crawls to complete and gather as much data as possible during beta. This should be revisited before production release.

Questions for Discussion

Should we add a user-visible warning when chunks_stored < chunk_count?
Should we set a threshold (e.g., fail if < 50% chunks stored)?
Should this be a configurable behavior (strict vs permissive mode)?
How should we handle partial batch failures vs complete failures?

Related PR: #514 (refactor-remove-sockets branch)

Aug 29 '25 13:08 Wirasm