dify icon indicating copy to clipboard operation
dify copied to clipboard

Knowledge base keyword storage design issues

Open yunfeng13 opened this issue 1 year ago • 4 comments

Self Checks

  • [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

1726637080635 Every time a new document is uploaded to the knowledge base, the keyword_table field will be fully updated, resulting in low efficiency, and the field data will be very large, resulting in errors and read and write congestion.

✔️ Expected Behavior

Is this the design or are we discussing solutions?

❌ Actual Behavior

A large number of files in a single knowledge base will cause blockage of modifications to the knowledge base and new embedded files, resulting in errors.

yunfeng13 avatar Sep 18 '24 05:09 yunfeng13

I found a similar unsolved discussion that might be relevant to your issue: Details about knowledge base and document maintenance [1].

Regarding your concern, the design of the keyword_table field in the knowledge base is not intended to fully update with each new document. Instead, the code provides mechanisms to incrementally update the keyword_table with new documents, which helps to avoid inefficiency and errors.

Here are the relevant parts of the code that demonstrate this:

  1. Adding Texts Incrementally:

    def add_texts(self, texts: list[Document], **kwargs):
        lock_name = "keyword_indexing_lock_{}".format(self.dataset.id)
        with redis_client.lock(lock_name, timeout=600):
            keyword_table_handler = JiebaKeywordTableHandler()
            keyword_table = self._get_dataset_keyword_table()
            keywords_list = kwargs.get("keywords_list", None)
            for i in range(len(texts)):
                text = texts[i]
                if keywords_list:
                    keywords = keywords_list[i]
                    if not keywords:
                        keywords = keyword_table_handler.extract_keywords(
                            text.page_content, self._config.max_keywords_per_chunk
                        )
                else:
                    keywords = keyword_table_handler.extract_keywords(
                        text.page_content, self._config.max_keywords_per_chunk
                    )
                self._update_segment_keywords(self.dataset.id, text.metadata["doc_id"], list(keywords))
                keyword_table = self._add_text_to_keyword_table(keyword_table, text.metadata["doc_id"], list(keywords))
    
            self._save_dataset_keyword_table(keyword_table)
    
  2. Updating Segment Keywords:

    def _update_segment_keywords(self, dataset_id: str, node_id: str, keywords: list[str]):
        document_segment = (
            db.session.query(DocumentSegment)
            .filter(DocumentSegment.dataset_id == dataset_id, DocumentSegment.index_node_id == node_id)
            .first()
        )
        if document_segment:
            document_segment.keywords = keywords
            db.session.add(document_segment)
            db.session.commit()
    
  3. Adding Text to Keyword Table:

    def _add_text_to_keyword_table(self, keyword_table: dict, id: str, keywords: list[str]) -> dict:
        for keyword in keywords:
            if keyword not in keyword_table:
                keyword_table[keyword] = set()
            keyword_table[keyword].add(id)
        return keyword_table
    

Additionally, the DatasetKeywordTable class in the dataset.py file provides a property keyword_table_dict that handles the keyword table data efficiently, either from the database or from a file storage, ensuring that the keyword table is managed properly without needing a full rebuild:

class DatasetKeywordTable(db.Model):
    __tablename__ = "dataset_keyword_tables"
    __table_args__ = (
        db.PrimaryKeyConstraint("id", name="dataset_keyword_table_pkey"),
        db.Index("dataset_keyword_table_dataset_id_idx", "dataset_id"),
    )

    id = db.Column(StringUUID, primary_key=True, server_default=db.text("uuid_generate_v4()"))
    dataset_id = db.Column(StringUUID, nullable=False, unique=True)
    keyword_table = db.Column(db.Text, nullable=False)
    data_source_type = db.Column(
        db.String(255), nullable=False, server_default=db.text("'database'::character varying")
    )

    @property
    def keyword_table_dict(self):
        class SetDecoder(json.JSONDecoder):
            def __init__(self, *args, **kwargs):
                super().__init__(object_hook=self.object_hook, *args, **kwargs)

            def object_hook(self, dct):
                if isinstance(dct, dict):
                    for keyword, node_idxs in dct.items():
                        if isinstance(node_idxs, list):
                            dct[keyword] = set(node_idxs)
                return dct

        # get dataset
        dataset = Dataset.query.filter_by(id=self.dataset_id).first()
        if not dataset:
            return None
        if self.data_source_type == "database":
            return json.loads(self.keyword_table, cls=SetDecoder) if self.keyword_table else None
        else:
            file_key = "keyword_files/" + dataset.tenant_id + "/" + self.dataset_id + ".txt"
            try:
                keyword_table_text = storage.load_once(file_key)
                if keyword_table_text:
                    return json.loads(keyword_table_text.decode("utf-8"), cls=SetDecoder)
                return None
            except Exception as e:
                logging.exception(str(e))
                return None

These methods ensure that the keyword_table is updated incrementally rather than being fully rebuilt with each new document, thus improving efficiency and reducing the likelihood of errors [2][3].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 18 '24 05:09 dosubot[bot]

@JohnJyong

Yawen-1010 avatar Oct 22 '24 08:10 Yawen-1010

@Yawen-1010 Is there any new progress? The storage size of a single record in this table is almost 500g, and I only uploaded a 2GB document.

glacierck avatar Oct 23 '24 06:10 glacierck

Hi, @18075717849. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • You reported a design problem in Dify version 0.8.1 related to inefficient updates and large data sizes in the keyword_table field.
  • I provided an explanation of the intended incremental update mechanism to address these inefficiencies.
  • Another user, glacierck, noted a significant storage size issue, with a single record reaching almost 500GB after a 2GB document upload, and asked about progress.

Next Steps

  • Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting here.
  • If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Nov 23 '24 16:11 dosubot[bot]

This is my solution, disable the data of this table! ` CREATE OR REPLACE FUNCTION delete_on_insert() RETURNS TRIGGER AS $$ BEGIN -- 删除新插入的记录 DELETE FROM dataset_keyword_tables WHERE id = NEW.id; -- 返回NULL,表示触发器已成功处理事件,不需要进一步处理 RETURN NULL; END; $$ LANGUAGE plpgsql;

CREATE TRIGGER before_insert_delete BEFORE INSERT ON dataset_keyword_tables FOR EACH ROW EXECUTE FUNCTION delete_on_insert();

SELECT tgname AS trigger_name, pg_get_triggerdef(oid) AS trigger_definition FROM pg_trigger WHERE tgrelid = 'dataset_keyword_tables'::regclass; `

glacierck avatar Jan 22 '25 09:01 glacierck