dify The table 'dataset_keyword_tables' is too large, resulting in insufficient server storage

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Hello, I am using a PG database and there is a table named 'dataset_keyword_tables' in the database with a large amount of data; Through testing, uploading a file of about 1M, the size increment of this table can reach 300M, which is strange. Now my table has 300 GB, and I want to know what specific use this table has? Can I clean it up? Otherwise, as more and more knowledge base documents are uploaded, this table will become larger and larger.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Sep 14 '24 01:09 starweavingdream

Hey @starweavingdream! I'm here to help you with any bugs, questions, or contributions you have for the project. Let me know how I can assist you!

The dataset_keyword_tables table in the PG database is used to store keyword tables associated with datasets. Each entry in this table includes a unique identifier, the dataset ID it is associated with, the keyword table itself (stored as text), and the data source type.

To clean up the dataset_keyword_tables table and prevent it from becoming excessively large, you can remove entries that are no longer needed. This can be done by identifying and deleting rows that are associated with datasets that are no longer in use or have been deleted.

Here is an example of how you might perform such a cleanup operation:

from extensions.ext_database import db
from models.dataset import Dataset, DatasetKeywordTable

def cleanup_dataset_keyword_tables():
    # Find all dataset IDs that are still in use
    active_dataset_ids = {dataset.id for dataset in Dataset.query.all()}

    # Find all entries in the dataset_keyword_tables
    all_keyword_tables = DatasetKeywordTable.query.all()

    # Delete entries that are not associated with active datasets
    for keyword_table in all_keyword_tables:
        if keyword_table.dataset_id not in active_dataset_ids:
            db.session.delete(keyword_table)

    # Commit the changes to the database
    db.session.commit()

# Call the cleanup function
cleanup_dataset_keyword_tables()

Additionally, Dify has an automated cleanup mechanism in the form of a scheduled task named clean_unused_datasets_task that manages the size of the 'dataset_keyword_tables' table. This task removes unused datasets and their associated indexes if they haven't been updated within a specified number of days ^[1].

For more details on the structure and indexing of the table, you can refer to the initial database setup in the Dify platform ^[2]^[3].

If you need to limit the size of the 'dataset_keyword_tables' table, you can use the max_keywords_per_chunk setting in the KeywordTableConfig class, which controls the number of keywords extracted per chunk of text ^[4].

To continue talking to Dosu, mention @dosu.

Sep 14 '24 01:09 dosubot[bot]

Same problem. Did you find any solution? I found that this table is only used by keyword search. When indexing document, this table will slow down the process.

Oct 11 '24 02:10 Scallions

@Scallions I tried the solution given by the robot, and there were no particularly obvious problems after I cleared the data. If you want to try my method, I suggest you back up your data first.

Oct 15 '24 07:10 starweavingdream

@starweavingdream I need you! I'm going nuts!

Oct 23 '24 01:10 glacierck

same problems. .... so difficult..... The system has a lot to change.....

Nov 01 '24 12:11 luckylhb90

This is my solution, disable the data of this table！

https://github.com/langgenius/dify/issues/8513#issuecomment-2606650048

Jan 22 '25 09:01 glacierck

@Scallions I tried the solution given by the robot, and there were no particularly obvious problems after I cleared the data. If you want to try my method, I suggest you back up your data first.

Did it work? 好使吗？

Mar 08 '25 01:03 gemchen