Add a task that deletes the old data that has not been accessed in a while
A task for automatically deleting data in the memify pipeline that hasn't been accessed by retrievers for a specified period.
NOTE: This issue is part of Contribute-to-Win. Please comment first to get assigned. Read the details here
Overview
This task identifies and removes unused data (chunks, entities, summaries, associations) from the memify pipeline based on retrieval access patterns, helping maintain system efficiency and storage optimization.
Usage
from cognee.tasks.cleanup import cleanup_unused_data
# Preview cleanup (safe mode)
result = await cleanup_unused_data(days_threshold=30, dry_run=True)
# Execute cleanup
result = await cleanup_unused_data(days_threshold=30, dry_run=False)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
days_threshold |
int | 30 | Days since last access to consider unused |
user_id |
UUID | None | Limit to specific user's data |
What Gets Cleaned
- Document Chunks: Unused text segments
- Entities: Unaccessed extracted concepts
- Summaries: Unused generated summaries
- Associations: Unused chunk relationships
- Metadata: Related database records
Return Format
{
"status": "completed", # or "dry_run"
"unused_count": 150,
"deleted_count": {
"data_items": 25,
"chunks": 120,
"entities": 300,
"summaries": 45,
"associations": 80
},
"cleanup_date": "2024-01-15T10:30:00Z"
}
can i work on this?
hey @HashimmS thanks for the interest in contributing! The issue is now assigned to you.
hey @HashimmS, how is the progress? Do you have a question? As this issue is a part of the challenge, we want to have quick iterations :) please update us! the issue will be un-assigned if no PR is opened in the next 24 hrs
Hey @hande-k, I would love to work on this!
hey @Pravesh-Sudha nice to see you here! the issue is assigned to you now. Looking forward to your PR :)
Hey @hande-k, I noticed the Data model lacks a last_accessed field. To avoid schema changes, I propose using created_at to identify old data for cleanup. Is this acceptable, or should I add last_accessed with a migration?
Becuase this field would be the key to know the last date when the user accessed data.
Here is the file: https://github.com/topoteretes/cognee/blob/main/cognee/modules/data/models/Data.py
Can this issue still be assigned??
@chinu0609 assigned
Hi , is this issue resolved if not would love to work on it