cognee icon indicating copy to clipboard operation
cognee copied to clipboard

Add a task that deletes the old data that has not been accessed in a while

Open Vasilije1990 opened this issue 7 months ago • 9 comments

A task for automatically deleting data in the memify pipeline that hasn't been accessed by retrievers for a specified period.

NOTE: This issue is part of Contribute-to-Win. Please comment first to get assigned. Read the details here

Overview

This task identifies and removes unused data (chunks, entities, summaries, associations) from the memify pipeline based on retrieval access patterns, helping maintain system efficiency and storage optimization.

Usage

from cognee.tasks.cleanup import cleanup_unused_data

# Preview cleanup (safe mode)
result = await cleanup_unused_data(days_threshold=30, dry_run=True)

# Execute cleanup
result = await cleanup_unused_data(days_threshold=30, dry_run=False)

Parameters

Parameter Type Default Description
days_threshold int 30 Days since last access to consider unused
user_id UUID None Limit to specific user's data

What Gets Cleaned

  • Document Chunks: Unused text segments
  • Entities: Unaccessed extracted concepts
  • Summaries: Unused generated summaries
  • Associations: Unused chunk relationships
  • Metadata: Related database records

Return Format

{
    "status": "completed",  # or "dry_run"
    "unused_count": 150,
    "deleted_count": {
        "data_items": 25,
        "chunks": 120, 
        "entities": 300,
        "summaries": 45,
        "associations": 80
    },
    "cleanup_date": "2024-01-15T10:30:00Z"
}

Vasilije1990 avatar Sep 05 '25 16:09 Vasilije1990

can i work on this?

HashimmS avatar Sep 06 '25 11:09 HashimmS

hey @HashimmS thanks for the interest in contributing! The issue is now assigned to you.

hande-k avatar Sep 10 '25 11:09 hande-k

hey @HashimmS, how is the progress? Do you have a question? As this issue is a part of the challenge, we want to have quick iterations :) please update us! the issue will be un-assigned if no PR is opened in the next 24 hrs

hande-k avatar Sep 16 '25 09:09 hande-k

Hey @hande-k, I would love to work on this!

Pravesh-Sudha avatar Oct 06 '25 13:10 Pravesh-Sudha

hey @Pravesh-Sudha nice to see you here! the issue is assigned to you now. Looking forward to your PR :)

hande-k avatar Oct 06 '25 14:10 hande-k

Hey @hande-k, I noticed the Data model lacks a last_accessed field. To avoid schema changes, I propose using created_at to identify old data for cleanup. Is this acceptable, or should I add last_accessed with a migration?

Becuase this field would be the key to know the last date when the user accessed data.

Here is the file: https://github.com/topoteretes/cognee/blob/main/cognee/modules/data/models/Data.py

Pravesh-Sudha avatar Oct 09 '25 08:10 Pravesh-Sudha

Can this issue still be assigned??

chinu0609 avatar Oct 25 '25 04:10 chinu0609

@chinu0609 assigned

Vasilije1990 avatar Oct 26 '25 09:10 Vasilije1990

Hi , is this issue resolved if not would love to work on it

BHIMASAIKAUSHIK avatar Nov 21 '25 19:11 BHIMASAIKAUSHIK