dify Knowledge base keyword storage design issues

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1726637080635 Every time a new document is uploaded to the knowledge base, the keyword_table field will be fully updated, resulting in low efficiency, and the field data will be very large, resulting in errors and read and write congestion.

✔️ Expected Behavior

Is this the design or are we discussing solutions?

❌ Actual Behavior

A large number of files in a single knowledge base will cause blockage of modifications to the knowledge base and new embedded files, resulting in errors.

Sep 18 '24 05:09 yunfeng13

I found a similar unsolved discussion that might be relevant to your issue: Details about knowledge base and document maintenance ^[1].

Regarding your concern, the design of the keyword_table field in the knowledge base is not intended to fully update with each new document. Instead, the code provides mechanisms to incrementally update the keyword_table with new documents, which helps to avoid inefficiency and errors.

Here are the relevant parts of the code that demonstrate this:

Adding Texts Incrementally:

def add_texts(self, texts: list[Document], **kwargs):
    lock_name = "keyword_indexing_lock_{}".format(self.dataset.id)
    with redis_client.lock(lock_name, timeout=600):
        keyword_table_handler = JiebaKeywordTableHandler()
        keyword_table = self._get_dataset_keyword_table()
        keywords_list = kwargs.get("keywords_list", None)
        for i in range(len(texts)):
            text = texts[i]
            if keywords_list:
                keywords = keywords_list[i]
                if not keywords:
                    keywords = keyword_table_handler.extract_keywords(
                        text.page_content, self._config.max_keywords_per_chunk
                    )
            else:
                keywords = keyword_table_handler.extract_keywords(
                    text.page_content, self._config.max_keywords_per_chunk
                )
            self._update_segment_keywords(self.dataset.id, text.metadata["doc_id"], list(keywords))
            keyword_table = self._add_text_to_keyword_table(keyword_table, text.metadata["doc_id"], list(keywords))

        self._save_dataset_keyword_table(keyword_table)

Updating Segment Keywords:

def _update_segment_keywords(self, dataset_id: str, node_id: str, keywords: list[str]):
    document_segment = (
        db.session.query(DocumentSegment)
        .filter(DocumentSegment.dataset_id == dataset_id, DocumentSegment.index_node_id == node_id)
        .first()
    )
    if document_segment:
        document_segment.keywords = keywords
        db.session.add(document_segment)
        db.session.commit()

Adding Text to Keyword Table:

def _add_text_to_keyword_table(self, keyword_table: dict, id: str, keywords: list[str]) -> dict:
    for keyword in keywords:
        if keyword not in keyword_table:
            keyword_table[keyword] = set()
        keyword_table[keyword].add(id)
    return keyword_table

Additionally, the DatasetKeywordTable class in the dataset.py file provides a property keyword_table_dict that handles the keyword table data efficiently, either from the database or from a file storage, ensuring that the keyword table is managed properly without needing a full rebuild:

class DatasetKeywordTable(db.Model):
    __tablename__ = "dataset_keyword_tables"
    __table_args__ = (
        db.PrimaryKeyConstraint("id", name="dataset_keyword_table_pkey"),
        db.Index("dataset_keyword_table_dataset_id_idx", "dataset_id"),
    )

    id = db.Column(StringUUID, primary_key=True, server_default=db.text("uuid_generate_v4()"))
    dataset_id = db.Column(StringUUID, nullable=False, unique=True)
    keyword_table = db.Column(db.Text, nullable=False)
    data_source_type = db.Column(
        db.String(255), nullable=False, server_default=db.text("'database'::character varying")
    )

    @property
    def keyword_table_dict(self):
        class SetDecoder(json.JSONDecoder):
            def __init__(self, *args, **kwargs):
                super().__init__(object_hook=self.object_hook, *args, **kwargs)

            def object_hook(self, dct):
                if isinstance(dct, dict):
                    for keyword, node_idxs in dct.items():
                        if isinstance(node_idxs, list):
                            dct[keyword] = set(node_idxs)
                return dct

        # get dataset
        dataset = Dataset.query.filter_by(id=self.dataset_id).first()
        if not dataset:
            return None
        if self.data_source_type == "database":
            return json.loads(self.keyword_table, cls=SetDecoder) if self.keyword_table else None
        else:
            file_key = "keyword_files/" + dataset.tenant_id + "/" + self.dataset_id + ".txt"
            try:
                keyword_table_text = storage.load_once(file_key)
                if keyword_table_text:
                    return json.loads(keyword_table_text.decode("utf-8"), cls=SetDecoder)
                return None
            except Exception as e:
                logging.exception(str(e))
                return None

These methods ensure that the keyword_table is updated incrementally rather than being fully rebuilt with each new document, thus improving efficiency and reducing the likelihood of errors ^[2]^[3].

To continue talking to Dosu, mention @dosu.

Sep 18 '24 05:09 dosubot[bot]

@JohnJyong

Oct 22 '24 08:10 Yawen-1010

@Yawen-1010 Is there any new progress? The storage size of a single record in this table is almost 500g, and I only uploaded a 2GB document.

Oct 23 '24 06:10 glacierck

Hi, @18075717849. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

You reported a design problem in Dify version 0.8.1 related to inefficient updates and large data sizes in the keyword_table field.
I provided an explanation of the intended incremental update mechanism to address these inefficiencies.
Another user, glacierck, noted a significant storage size issue, with a single record reaching almost 500GB after a 2GB document upload, and asked about progress.

Next Steps

Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting here.
If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

Nov 23 '24 16:11 dosubot[bot]

This is my solution, disable the data of this table！ ` CREATE OR REPLACE FUNCTION delete_on_insert() RETURNS TRIGGER AS $$ BEGIN -- 删除新插入的记录 DELETE FROM dataset_keyword_tables WHERE id = NEW.id; -- 返回NULL，表示触发器已成功处理事件，不需要进一步处理 RETURN NULL; END; $$ LANGUAGE plpgsql;

CREATE TRIGGER before_insert_delete BEFORE INSERT ON dataset_keyword_tables FOR EACH ROW EXECUTE FUNCTION delete_on_insert();

SELECT tgname AS trigger_name, pg_get_triggerdef(oid) AS trigger_definition FROM pg_trigger WHERE tgrelid = 'dataset_keyword_tables'::regclass; `

Jan 22 '25 09:01 glacierck