chroma icon indicating copy to clipboard operation
chroma copied to clipboard

using pickle have some limitations due to operating system and Python version dependencies.

Open sergerdn opened this issue 1 year ago • 0 comments

I believe that using pickle may not always be the best option in certain cases due to its dependencies on the operating system and Python version. For instance, if I upgrade my Python version, it may no longer work properly. Similarly, if I create an object on my Mac or Windows computer, it may not work on my Linux server.

I have seen many topics on Stack Overflow and many other sources where people get confused because of this. For example here's the latest one -https://github.com/hwchase17/langchain/discussions/1533

https://github.com/chroma-core/chroma/blob/35c59f4db07adc51b48fca60a38633c5547a2cf4/chromadb/db/index/hnswlib.py#LL186C47-L186C71

Using pickle is very insecure and can pose significant security risks: https://docs.python.org/3/library/pickle.html

Warning The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. 
Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

What should be done:

  • Move the saving and restoring functions to a different class.
  • Improve the different class by using some other serialisers.

Creating a separate class for handling the saving and restoring of data is a good approach. It not only allows for easy testing and refactoring but also makes it simple to switch to another serialization method in the future if needed.

Some untested code:

class IndexStorage:
    def __init__(self, save_folder):
        self._save_folder = save_folder

    def save(self, id, index, label_to_id, id_to_label, index_metadata):
        if not os.path.exists(self._save_folder):
            os.makedirs(self._save_folder)

        if index is None:
            return
        index.save_index(f"{self._save_folder}/index_{id}.bin")

        with open(f"{self._save_folder}/id_to_uuid_{id}.pkl", "wb") as f:
            pickle.dump(label_to_id, f, pickle.HIGHEST_PROTOCOL)
        with open(f"{self._save_folder}/uuid_to_id_{id}.pkl", "wb") as f:
            pickle.dump(id_to_label, f, pickle.HIGHEST_PROTOCOL)
        with open(f"{self._save_folder}/index_metadata_{id}.pkl", "wb") as f:
            pickle.dump(index_metadata, f, pickle.HIGHEST_PROTOCOL)

    def load(self, id):
        if not os.path.exists(f"{self._save_folder}/index_{id}.bin"):
            return None, None, None, None

        with open(f"{self._save_folder}/id_to_uuid_{id}.pkl", "rb") as f:
            label_to_id = pickle.load(f)
        with open(f"{self._save_folder}/uuid_to_id_{id}.pkl", "rb") as f:
            id_to_label = pickle.load(f)
        with open(f"{self._save_folder}/index_metadata_{id}.pkl", "rb") as f:
            index_metadata = pickle.load(f)

        p = hnswlib.Index(space=index_metadata['space'], dim=index_metadata["dimensionality"])
        p.load_index(
            f"{self._save_folder}/index_{id}.bin",
            max_elements=index_metadata["elements"],
        )
        return p, label_to_id, id_to_label, index_metadata

Using abstract class:

from abc import ABC, abstractmethod

class AbstractIndexStorage(ABC):
    def __init__(self, save_folder):
        self._save_folder = save_folder

    @abstractmethod
    def save(self, id, index, label_to_id, id_to_label, index_metadata):
        pass

    @abstractmethod
    def load(self, id):
        pass

class IndexStorage(AbstractIndexStorage):
    def save(self, id, index, label_to_id, id_to_label, index_metadata):
        pass

    def load(self, id):
        pass

If you intend to use different class for saving and restoring data, I am able to make a PR for it if you would like.

sergerdn avatar Mar 31 '23 10:03 sergerdn