chroma
chroma copied to clipboard
using pickle have some limitations due to operating system and Python version dependencies.
I believe that using pickle
may not always be the best option in certain cases due to its dependencies on the operating system and Python
version. For instance, if I upgrade my Python
version, it may no longer work properly. Similarly, if I create an object on my Mac
or Windows
computer, it may not work on my Linux
server.
I have seen many topics on Stack Overflow and many other sources where people get confused because of this. For example here's the latest one -https://github.com/hwchase17/langchain/discussions/1533
https://github.com/chroma-core/chroma/blob/35c59f4db07adc51b48fca60a38633c5547a2cf4/chromadb/db/index/hnswlib.py#LL186C47-L186C71
Using pickle
is very insecure and can pose significant security risks:
https://docs.python.org/3/library/pickle.html
Warning The pickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling.
Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
What should be done:
- Move the saving and restoring functions to a different class.
- Improve the different class by using some other serialisers.
Creating a separate class for handling the saving and restoring of data is a good approach. It not only allows for easy testing and refactoring but also makes it simple to switch to another serialization method in the future if needed.
Some untested code:
class IndexStorage:
def __init__(self, save_folder):
self._save_folder = save_folder
def save(self, id, index, label_to_id, id_to_label, index_metadata):
if not os.path.exists(self._save_folder):
os.makedirs(self._save_folder)
if index is None:
return
index.save_index(f"{self._save_folder}/index_{id}.bin")
with open(f"{self._save_folder}/id_to_uuid_{id}.pkl", "wb") as f:
pickle.dump(label_to_id, f, pickle.HIGHEST_PROTOCOL)
with open(f"{self._save_folder}/uuid_to_id_{id}.pkl", "wb") as f:
pickle.dump(id_to_label, f, pickle.HIGHEST_PROTOCOL)
with open(f"{self._save_folder}/index_metadata_{id}.pkl", "wb") as f:
pickle.dump(index_metadata, f, pickle.HIGHEST_PROTOCOL)
def load(self, id):
if not os.path.exists(f"{self._save_folder}/index_{id}.bin"):
return None, None, None, None
with open(f"{self._save_folder}/id_to_uuid_{id}.pkl", "rb") as f:
label_to_id = pickle.load(f)
with open(f"{self._save_folder}/uuid_to_id_{id}.pkl", "rb") as f:
id_to_label = pickle.load(f)
with open(f"{self._save_folder}/index_metadata_{id}.pkl", "rb") as f:
index_metadata = pickle.load(f)
p = hnswlib.Index(space=index_metadata['space'], dim=index_metadata["dimensionality"])
p.load_index(
f"{self._save_folder}/index_{id}.bin",
max_elements=index_metadata["elements"],
)
return p, label_to_id, id_to_label, index_metadata
Using abstract class:
from abc import ABC, abstractmethod
class AbstractIndexStorage(ABC):
def __init__(self, save_folder):
self._save_folder = save_folder
@abstractmethod
def save(self, id, index, label_to_id, id_to_label, index_metadata):
pass
@abstractmethod
def load(self, id):
pass
class IndexStorage(AbstractIndexStorage):
def save(self, id, index, label_to_id, id_to_label, index_metadata):
pass
def load(self, id):
pass
If you intend to use different class for saving and restoring data, I am able to make a PR for it if you would like.