ApeRAG
ApeRAG copied to clipboard
[Improvement] Improve Collection Deletion: Handle Document Source Files Cleanup and Resource Hierarchy Management
Current Behavior
When a collection is deleted:
- The collection is marked as DELETED in the database
- A
collection_delete_taskis triggered asynchronously - The collection's quota is released
- However, the source files of documents within the collection remain in storage (either local
.objectsdirectory or object storage like S3) - This leads to accumulating storage usage over time as these orphaned files are never cleaned up
Problem Impact
- Wasted Storage Space: Orphaned document files accumulate in storage
- Cost Implications: For cloud storage (S3), this means unnecessary ongoing storage costs
- Resource Management: Lack of clear cleanup strategy for hierarchical resources
- Potential Compliance Issues: Retaining user data that should have been deleted
Proposed Solutions
Approach 1: Synchronous Cascade Delete
Delete all child resources (documents, indexes, files) within the same transaction as the collection deletion.
Pros:
- Immediate consistency
- Simpler to reason about
- ACID compliance
Cons:
- Long-running transactions
- Potential timeout issues
- Higher failure risk
- Blocks the user request
Approach 2: Asynchronous Staged Deletion (Recommended)
Enhance the existing collection_delete_task to handle document cleanup in stages:
- Stage 1: Mark collection as DELETED (current behavior)
- Stage 2: Mark all documents as DELETED
- Stage 3: Delete document indexes
- Stage 4: Clean up source files
- Stage 5: Final cleanup of database records
Pros:
- Non-blocking
- Better failure handling
- Progress tracking
- Resource cleanup can be retried
- Scalable to large collections
Cons:
- Temporary inconsistency
- More complex implementation
- Need careful status tracking
Approach 3: Lazy Cleanup with Background Job
Keep current deletion behavior but add a periodic cleanup job:
- Mark resources as DELETED (current behavior)
- Run periodic job to find and clean up orphaned files
- Use timestamp-based cleanup strategy
Pros:
- Simple implementation
- Low impact on main operations
- Can batch cleanup operations
Cons:
- Delayed cleanup
- Resources held longer than necessary
- More complex monitoring needed
Implementation Details (for Approach 2)
- Enhance
collection_delete_task:
@app.task(bind=True)
def collection_delete_task(self, collection_id: str) -> Any:
# Stage 1: Current collection deletion logic
# Stage 2: Get all documents and mark as deleted
documents = document_service.get_collection_documents(collection_id)
for doc in documents:
document_service.delete_document(doc.user, collection_id, doc.id)
# Stage 3 & 4: Delete indexes and source files (handled by delete_document)
# Stage 5: Final cleanup
cleanup_collection_records(collection_id)
- Add status tracking for deletion stages
- Implement retry mechanisms for each stage
- Add monitoring and logging for cleanup progress
Questions to Consider
-
Consistency Requirements:
- How strict should the consistency between collection and document status be?
- Should we allow querying deleted collections/documents during deletion?
-
Recovery Strategy:
- How to handle partial failures during deletion?
- Should we implement an "undelete" feature within a time window?
-
Performance Impact:
- How to handle deletion of large collections?
- Should we implement batching for large deletions?
-
Compliance:
- Are there regulatory requirements for data deletion timing?
- Do we need to maintain deletion audit logs?
Success Criteria
- No orphaned files remain after collection deletion
- Deletion process is reliable and recoverable
- System remains responsive during deletion
- Clear status tracking of deletion progress
- Proper error handling and retry mechanisms
- Comprehensive logging for audit purposes
Related Issues
- #XXX Storage optimization
- #YYY Resource management improvements
Next Steps
- [ ] Review and select preferred approach
- [ ] Design detailed implementation plan
- [ ] Implement monitoring and metrics
- [ ] Add deletion progress tracking
- [ ] Update documentation
This issue has been marked as stale because it has been open for 30 days with no activity