[Improvement] Improve Collection Deletion: Handle Document Source Files Cleanup and Resource Hierarchy Management

Open iziang opened this issue 4 months ago • 1 comments

Current Behavior

When a collection is deleted:

The collection is marked as DELETED in the database
A collection_delete_task is triggered asynchronously
The collection's quota is released
However, the source files of documents within the collection remain in storage (either local .objects directory or object storage like S3)
This leads to accumulating storage usage over time as these orphaned files are never cleaned up

Problem Impact

Wasted Storage Space: Orphaned document files accumulate in storage
Cost Implications: For cloud storage (S3), this means unnecessary ongoing storage costs
Resource Management: Lack of clear cleanup strategy for hierarchical resources
Potential Compliance Issues: Retaining user data that should have been deleted

Proposed Solutions

Approach 1: Synchronous Cascade Delete

Delete all child resources (documents, indexes, files) within the same transaction as the collection deletion.

Pros:

Immediate consistency
Simpler to reason about
ACID compliance

Cons:

Long-running transactions
Potential timeout issues
Higher failure risk
Blocks the user request

Approach 2: Asynchronous Staged Deletion (Recommended)

Enhance the existing collection_delete_task to handle document cleanup in stages:

Stage 1: Mark collection as DELETED (current behavior)
Stage 2: Mark all documents as DELETED
Stage 3: Delete document indexes
Stage 4: Clean up source files
Stage 5: Final cleanup of database records

Pros:

Non-blocking
Better failure handling
Progress tracking
Resource cleanup can be retried
Scalable to large collections

Cons:

Temporary inconsistency
More complex implementation
Need careful status tracking

Approach 3: Lazy Cleanup with Background Job

Keep current deletion behavior but add a periodic cleanup job:

Mark resources as DELETED (current behavior)
Run periodic job to find and clean up orphaned files
Use timestamp-based cleanup strategy

Pros:

Simple implementation
Low impact on main operations
Can batch cleanup operations

Cons:

Delayed cleanup
Resources held longer than necessary
More complex monitoring needed

Implementation Details (for Approach 2)

Enhance collection_delete_task:

@app.task(bind=True)
def collection_delete_task(self, collection_id: str) -> Any:
    # Stage 1: Current collection deletion logic
    
    # Stage 2: Get all documents and mark as deleted
    documents = document_service.get_collection_documents(collection_id)
    for doc in documents:
        document_service.delete_document(doc.user, collection_id, doc.id)
    
    # Stage 3 & 4: Delete indexes and source files (handled by delete_document)
    
    # Stage 5: Final cleanup
    cleanup_collection_records(collection_id)

Add status tracking for deletion stages
Implement retry mechanisms for each stage
Add monitoring and logging for cleanup progress

Questions to Consider

Consistency Requirements:
- How strict should the consistency between collection and document status be?
- Should we allow querying deleted collections/documents during deletion?
Recovery Strategy:
- How to handle partial failures during deletion?
- Should we implement an "undelete" feature within a time window?
Performance Impact:
- How to handle deletion of large collections?
- Should we implement batching for large deletions?
Compliance:
- Are there regulatory requirements for data deletion timing?
- Do we need to maintain deletion audit logs?

Success Criteria

No orphaned files remain after collection deletion
Deletion process is reliable and recoverable
System remains responsive during deletion
Clear status tracking of deletion progress
Proper error handling and retry mechanisms
Comprehensive logging for audit purposes

Related Issues

#XXX Storage optimization
#YYY Resource management improvements

Next Steps

[ ] Review and select preferred approach
[ ] Design detailed implementation plan
[ ] Implement monitoring and metrics
[ ] Add deletion progress tracking
[ ] Update documentation

Aug 20 '25 02:08 iziang

This issue has been marked as stale because it has been open for 30 days with no activity

Sep 22 '25 00:09 github-actions[bot]