Fix cache eviction and soft deleted using MySQL and Mivuls

Open powerli2002 opened this issue 1 year ago • 0 comments

Abstract:

Implemented soft delete and hard delete in MySQL.
Implemented a cache eviction strategy using MySQL and Mivuls.

Problems Solved:

Multiple methods were not implemented, causing issues when applying the cache eviction strategy.
The mark_deleted method did not perform a soft delete but instead directly deleted the data by mark_id.
The cache_codegpt_answer table in the database did not have an is_deleted field, making soft deletes impossible.
The table creation statement you provided was only applicable to SQLite for modelcache_llm_answer, while MySQL used the cache_codegpt_answer table.
The self._vector_base.delete method did not specify a model, making it impossible to delete the corresponding table during cache eviction.

Modifications:

Implemented true soft delete:
- Renamed and added fields to the table in the database.
- In modelcache\manager\scalar_data\sql_storage.py:
  - Implemented the mark_deleted method to mark the is_deleted field as 1 (pending deletion).
  - Implemented the clear_deleted_data method.
  - Implemented the count method.
Database modifications:
- (1) Renamed the table in reference_doc\create_table.sql from modelcache_llm_answer to cache_codegpt_answer, or alternatively modified the table_name in the code.
- (2) Added the is_deleted field to cache_codegpt_answer, with -1 for pending deletion and 0 for not deleted (consistent with GPTCache).
Implemented the cache eviction strategy, defaulting to LRU:
- In modelcache\manager\eviction_manager.py, added a model parameter to the delete method to enable deletion of corresponding IDs in Mivuls.

Considering that I only implemented the method for MySQL, I did not directly apply the cache eviction strategy in data_manager.py. To use the cache eviction strategy (MySQL + Mivuls), you need to add the following in modelcache\manager\data_manager.py:

class SSDataManager(DataManager):
    def __init__(
        self,
        s: CacheStorage,
        v: VectorBase,
        o: Optional[ObjectBase],
        e: Optional[EvictionBase],
        max_size,
        clean_size,
        policy="LRU",
    ):
        self.max_size = max_size
        self.clean_size = clean_size
        self.s = s
        self.v = v
        self.o = o
        self.eviction_manager = EvictionManager(self.s, self.v)
        if e is None:
            e = EvictionBase(name="memory",
                             maxsize=max_size,
                             clean_size=clean_size,
                             policy=policy,
                             on_evict=self._clear)
        self.eviction_base = e
        self.model = None

    def _clear(self, marked_keys):
        self.eviction_manager.soft_evict(marked_keys)
        # Soft delete
        if self.eviction_manager.check_evict():
            self.eviction_manager.delete(self.model)

    def save(self, system_sentence, sys_embedding_data, question, answer, embedding_data, **kwargs):
        self.model = kwargs.pop("model", None)
        self.import_data([system_sentence], [sys_embedding_data], [question], [answer], [embedding_data], self.model)

The definition of self.model is to inform the data manager which model's table is being processed during insertion. The code refers to gptcache and tries to be functionally consistent.

This method has been locally verified to be feasible, with both the eviction strategy and soft delete functioning properly.

Please feel free to contact me if there are any issues with my changes.

Jun 24 '24 09:06 powerli2002