[BUG] Physical Files Not Deleted When Removing Documents via UI
Description
Description:
When deleting documents through the Kotaemon UI, the file appears to be removed from the interface, but the corresponding physical file remains on the file system (e.g., in the LanceDB data directory at /ktem_app_data/user_data/docstore/index_1.lance/data). This issue also affects other storage types such as ChromaDB, chunked .md files, and uploaded .xlsx files, leading to potential storage bloat and state inconsistencies.
Steps to Reproduce:
Upload a file using the Kotaemon UI.
Delete the file by clicking the delete button in the UI.
Verify that while the UI no longer shows the file, it still exists in the file system.
Expected Behavior:
The deletion process should remove the file both from the UI (logical deletion in the database) and from the file system (physical deletion). This can be achieved by incorporating file system operations such as os.remove() for files or shutil.rmtree() for directories after the corresponding database records are deleted.
Actual Behavior:
The UI triggers the deletion of database records (e.g., from the Source and Index tables) through the delete_event function, but it does not perform any physical file deletion. Consequently, files remain in directories such as the LanceDB data directory, even though they are no longer listed in the UI.
Suspected Cause:
The current deletion logic focuses solely on removing database entries and neglects to address the removal of the actual files from disk. The functions (like self._index._docstore.delete(ds_ids) and self._index._vs.delete(vs_ids)) likely handle only logical deletion, without invoking any file system deletion commands.
Browsers
Chrome
OS
MacOS
Additional information
No response
Related https://github.com/Cinnamon/kotaemon/issues/493. Planned fixes will be in the next release.
As a non-expert, I wonder if a change like this might work? After analyzing the bug, I see that while database records are deleted, the physical files remain on disk. I thought this is the one of solution that might address this issue:
- Add file cleanup helper to FileIndexPage class in
libs/ktem/ktem/index/file/ui.py
def _cleanup_physical_files(self, file_stem):
"""Clean up physical files related to the deleted document."""
if not file_stem:
return
# Clean up chunks and markdown files
for directory in [
flowsettings.KH_CHUNKS_OUTPUT_DIR,
flowsettings.KH_MARKDOWN_OUTPUT_DIR,
flowsettings.KH_ZIP_OUTPUT_DIR
]:
for file_path in Path(directory).glob(f"*{file_stem}*"):
try:
if file_path.is_file():
file_path.unlink()
elif file_path.is_dir():
shutil.rmtree(file_path)
print(f"Deleted: {file_path}")
except Exception as e:
print(f"Error deleting {file_path}: {e}")
# Check for original uploaded files in user_data/files/index_*
index_file_dir = Path(flowsettings.KH_USER_DATA_DIR) / "files" / f"index_{self._index.id}"
if index_file_dir.exists():
for file_path in index_file_dir.glob("*"):
# We need to check each file by opening and comparing content
# with the stem as files are stored with hash names
try:
if file_path.is_file():
# Only delete if we're confident this is the correct file
# Could add a more sophisticated check here if needed
file_path.unlink()
print(f"Deleted file: {file_path}")
except Exception as e:
print(f"Error deleting {file_path}: {e}")
- Update the delete_event method in
libs/ktem/ktem/index/file/ui.py
def delete_event(self, file_id):
file_name = ""
file_stem = ""
with Session(engine) as session:
source = session.execute(
select(self._index._resources["Source"]).where(
self._index._resources["Source"].id == file_id
)
).first()
if source:
file_name = source[0].name
file_stem = Path(file_name).stem # Extract stem before deletion
session.delete(source[0])
vs_ids, ds_ids = [], []
index = session.execute(
select(self._index._resources["Index"]).where(
self._index._resources["Index"].source_id == file_id
)
).all()
for each in index:
if each[0].relation_type == "vector":
vs_ids.append(each[0].target_id)
elif each[0].relation_type == "document":
ds_ids.append(each[0].target_id)
session.delete(each[0])
session.commit()
# Delete from vector store and document store
if vs_ids:
self._index._vs.delete(vs_ids)
self._index._docstore.delete(ds_ids)
# Add this line to clean up physical files
self._cleanup_physical_files(file_stem)
gr.Info(f"File {file_name} has been deleted")
return None, self.selected_panel_false
- Update the delete_all_files method in the same file
def delete_all_files(self, file_list):
for file_id in file_list.id.values:
if file_id == "-": # Skip placeholder row
continue
# Delete database records and perform standard cleanup
self.delete_event(file_id)