ApeRAG icon indicating copy to clipboard operation
ApeRAG copied to clipboard

[Improvement] Improve Collection Deletion: Handle Document Source Files Cleanup and Resource Hierarchy Management

Open iziang opened this issue 4 months ago • 1 comments

Current Behavior

When a collection is deleted:

  1. The collection is marked as DELETED in the database
  2. A collection_delete_task is triggered asynchronously
  3. The collection's quota is released
  4. However, the source files of documents within the collection remain in storage (either local .objects directory or object storage like S3)
  5. This leads to accumulating storage usage over time as these orphaned files are never cleaned up

Problem Impact

  1. Wasted Storage Space: Orphaned document files accumulate in storage
  2. Cost Implications: For cloud storage (S3), this means unnecessary ongoing storage costs
  3. Resource Management: Lack of clear cleanup strategy for hierarchical resources
  4. Potential Compliance Issues: Retaining user data that should have been deleted

Proposed Solutions

Approach 1: Synchronous Cascade Delete

Delete all child resources (documents, indexes, files) within the same transaction as the collection deletion.

Pros:

  • Immediate consistency
  • Simpler to reason about
  • ACID compliance

Cons:

  • Long-running transactions
  • Potential timeout issues
  • Higher failure risk
  • Blocks the user request

Approach 2: Asynchronous Staged Deletion (Recommended)

Enhance the existing collection_delete_task to handle document cleanup in stages:

  1. Stage 1: Mark collection as DELETED (current behavior)
  2. Stage 2: Mark all documents as DELETED
  3. Stage 3: Delete document indexes
  4. Stage 4: Clean up source files
  5. Stage 5: Final cleanup of database records

Pros:

  • Non-blocking
  • Better failure handling
  • Progress tracking
  • Resource cleanup can be retried
  • Scalable to large collections

Cons:

  • Temporary inconsistency
  • More complex implementation
  • Need careful status tracking

Approach 3: Lazy Cleanup with Background Job

Keep current deletion behavior but add a periodic cleanup job:

  1. Mark resources as DELETED (current behavior)
  2. Run periodic job to find and clean up orphaned files
  3. Use timestamp-based cleanup strategy

Pros:

  • Simple implementation
  • Low impact on main operations
  • Can batch cleanup operations

Cons:

  • Delayed cleanup
  • Resources held longer than necessary
  • More complex monitoring needed

Implementation Details (for Approach 2)

  1. Enhance collection_delete_task:
@app.task(bind=True)
def collection_delete_task(self, collection_id: str) -> Any:
    # Stage 1: Current collection deletion logic
    
    # Stage 2: Get all documents and mark as deleted
    documents = document_service.get_collection_documents(collection_id)
    for doc in documents:
        document_service.delete_document(doc.user, collection_id, doc.id)
    
    # Stage 3 & 4: Delete indexes and source files (handled by delete_document)
    
    # Stage 5: Final cleanup
    cleanup_collection_records(collection_id)
  1. Add status tracking for deletion stages
  2. Implement retry mechanisms for each stage
  3. Add monitoring and logging for cleanup progress

Questions to Consider

  1. Consistency Requirements:

    • How strict should the consistency between collection and document status be?
    • Should we allow querying deleted collections/documents during deletion?
  2. Recovery Strategy:

    • How to handle partial failures during deletion?
    • Should we implement an "undelete" feature within a time window?
  3. Performance Impact:

    • How to handle deletion of large collections?
    • Should we implement batching for large deletions?
  4. Compliance:

    • Are there regulatory requirements for data deletion timing?
    • Do we need to maintain deletion audit logs?

Success Criteria

  1. No orphaned files remain after collection deletion
  2. Deletion process is reliable and recoverable
  3. System remains responsive during deletion
  4. Clear status tracking of deletion progress
  5. Proper error handling and retry mechanisms
  6. Comprehensive logging for audit purposes

Related Issues

  • #XXX Storage optimization
  • #YYY Resource management improvements

Next Steps

  1. [ ] Review and select preferred approach
  2. [ ] Design detailed implementation plan
  3. [ ] Implement monitoring and metrics
  4. [ ] Add deletion progress tracking
  5. [ ] Update documentation

iziang avatar Aug 20 '25 02:08 iziang

This issue has been marked as stale because it has been open for 30 days with no activity

github-actions[bot] avatar Sep 22 '25 00:09 github-actions[bot]