chroma
chroma copied to clipboard
[Bug]: Cosine distance greater than 1 in query distances
What happened?
Hi, I changed the default model to sup-simcse-roberta-large from Huggingface transformers and
I got a strange bug in the form of too large a distance than 1. (vector length of this model is 1024)
Custom model:
class HFEmbedder(BaseEmbedder):
def __init__(self, model = 'princeton-nlp/sup-simcse-roberta-large'):
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
self.model = AutoModel.from_pretrained(model).to(self.device)
self.tokenizer = AutoTokenizer.from_pretrained(model)
def get_embeddings(self, texts):
if type(texts) == str:
texts = [texts]
inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(self.device)
with torch.no_grad():
embeddings = self.model(**inputs, output_hidden_states=True, return_dict=True).pooler_output.detach().cpu().numpy().tolist()
return embeddings
def __call__(self, text):
return self.get_embeddings(text)
DB:
class CollectionOperator():
def __init__(self, collection_name, db_path = DB_PATH, embedder: BaseEmbedder = None):
self.embedder = embedder
self.client = chromadb.PersistentClient(path = db_path)
self.collection = self.client.get_or_create_collection(name = collection_name, embedding_function = self.embedder.get_embeddings)
def add(self, text, metadata = {}):
metadata['timestamp'] = str(datetime.datetime.now())
self.collection.add(
documents = [text],
metadatas = [metadata],
ids = [str(uuid.uuid4())]
)
def delete(self, id):
self.collection.delete(id)
def query(self, query, n_results, return_text = True):
query = self.collection.query(
query_texts = query,
n_results = n_results,
)
if return_text:
return query['documents'][0]
else:
return query
collection_operator = CollectionOperator("total-memory-1", embedder = HFEmbedder())
collection_operator.add("What is a memory?")
results = collection_operator.query("Memory refers to the psychological processes of storing information", 1, return_text = False)
print(results)
Result:
{'ids': [['0ba68b19-67f7-4909-8c25-6c7f9443ee1b']], 'distances': [[95.90450629074704]], 'metadatas': [[{'timestamp': '2023-11-08 00:34:52.222355'}]], 'embeddings': None, 'documents': [['What is a memory?']]}
What could be wrong? I think 95.905.... is not ok...
Default chromadb cosine function returns normal results:
from chromadb.utils.distance_functions import cosine
texts = [
"What is a memory?",
"Memory refers to the psychological processes of storing information"
]
embedder = HFEmbedder()
embeddings=embedder(texts)
print(1 - cosine(embeddings[0], embeddings[1]))
0.770159841082886
Versions
Chroma v: 0.4.15; Python v: 3.10.11
Relevant log output
No response
I suspect your vectors are not normalized but you are using cosine distance, for performance reasons we don't normalize the vectors for you in the index. The utils distance function is for testing mostly so it does normalize the vector.
Can you try normalizing the vectors?
I tried to normalize it, and it seemed to work.
Here is result:
{'ids': [['8b08e6ad-cc11-48c1-8b0e-7e9b5357f79b']], 'distances': [[0.45968019570383634]], 'metadatas': [[{'timestamp': '2023-11-08 01:49:49.344148'}]], 'embeddings': None, 'documents': [['What is a memory?']]}
But i have 2 questions:
- Why do utils.cosine and cosine in db return different results: 0.7701598400906307 for utils and 0.45968019570383634 for db?
- When i use
sentence-transformers/all-MiniLM-L6-v2in myHFEmbedderclass, distance function works correctly without normalizing, why is this so? And I also got different results, with and without normalizing: with normalizing: 0.9453292039179582 for utils and 0.10934163888430294 for db; without normalizing: 0.9453292035839903 for utils and 0.1784120570530413 for db;
Here is edited get_embeddings method:
def get_embeddings(self, texts):
if type(texts) == str:
texts = [texts]
inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(self.device)
with torch.no_grad():
embeddings = self.model(**inputs, output_hidden_states=True, return_dict=True).pooler_output.detach().cpu().numpy()
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized_embeddings = embeddings / norms
# convert to list
normalized_embeddings_list = normalized_embeddings.tolist()
return normalized_embeddings_list