chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Cosine distance greater than 1 in query distances

Open AkiRusProd opened this issue 2 years ago • 2 comments
trafficstars

What happened?

Hi, I changed the default model to sup-simcse-roberta-large from Huggingface transformers and I got a strange bug in the form of too large a distance than 1. (vector length of this model is 1024)

Custom model:

class HFEmbedder(BaseEmbedder):
    def __init__(self, model = 'princeton-nlp/sup-simcse-roberta-large'):
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model = AutoModel.from_pretrained(model).to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model)
        

    def get_embeddings(self, texts):
        if type(texts) == str:
            texts = [texts]

        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(self.device)

        with torch.no_grad():
            embeddings = self.model(**inputs, output_hidden_states=True, return_dict=True).pooler_output.detach().cpu().numpy().tolist()

        return embeddings

    def __call__(self, text):
        return self.get_embeddings(text)

DB:

class CollectionOperator():
    def __init__(self, collection_name, db_path = DB_PATH, embedder: BaseEmbedder = None):
        self.embedder = embedder
        self.client = chromadb.PersistentClient(path = db_path)
        self.collection = self.client.get_or_create_collection(name = collection_name, embedding_function = self.embedder.get_embeddings)

    def add(self, text, metadata = {}):
        metadata['timestamp'] = str(datetime.datetime.now())

        self.collection.add(
            documents = [text],
            metadatas = [metadata],
            ids = [str(uuid.uuid4())]
        )

    def delete(self, id):
        self.collection.delete(id)

    def query(self, query, n_results, return_text = True):
        query = self.collection.query(
            query_texts = query,
            n_results = n_results,
        )

        if return_text:
            return query['documents'][0]
        else:
            return query



collection_operator = CollectionOperator("total-memory-1", embedder = HFEmbedder())
collection_operator.add("What is a memory?")
results = collection_operator.query("Memory refers to the psychological processes of  storing information", 1, return_text = False)
print(results)

Result: {'ids': [['0ba68b19-67f7-4909-8c25-6c7f9443ee1b']], 'distances': [[95.90450629074704]], 'metadatas': [[{'timestamp': '2023-11-08 00:34:52.222355'}]], 'embeddings': None, 'documents': [['What is a memory?']]} What could be wrong? I think 95.905.... is not ok...

Default chromadb cosine function returns normal results:

from chromadb.utils.distance_functions import cosine

texts = [
    "What is a memory?",
    "Memory refers to the psychological processes of  storing information"
]

embedder = HFEmbedder()
embeddings=embedder(texts)
print(1 - cosine(embeddings[0], embeddings[1]))

0.770159841082886

Versions

Chroma v: 0.4.15; Python v: 3.10.11

Relevant log output

No response

AkiRusProd avatar Nov 07 '23 21:11 AkiRusProd

I suspect your vectors are not normalized but you are using cosine distance, for performance reasons we don't normalize the vectors for you in the index. The utils distance function is for testing mostly so it does normalize the vector.

Can you try normalizing the vectors?

HammadB avatar Nov 07 '23 22:11 HammadB

I tried to normalize it, and it seemed to work. Here is result: {'ids': [['8b08e6ad-cc11-48c1-8b0e-7e9b5357f79b']], 'distances': [[0.45968019570383634]], 'metadatas': [[{'timestamp': '2023-11-08 01:49:49.344148'}]], 'embeddings': None, 'documents': [['What is a memory?']]} But i have 2 questions:

  1. Why do utils.cosine and cosine in db return different results: 0.7701598400906307 for utils and 0.45968019570383634 for db?
  2. When i use sentence-transformers/all-MiniLM-L6-v2 in my HFEmbedder class, distance function works correctly without normalizing, why is this so? And I also got different results, with and without normalizing: with normalizing: 0.9453292039179582 for utils and 0.10934163888430294 for db; without normalizing: 0.9453292035839903 for utils and 0.1784120570530413 for db;

Here is edited get_embeddings method:

    def get_embeddings(self, texts):
        if type(texts) == str:
            texts = [texts]

        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(self.device)

        with torch.no_grad():
            embeddings = self.model(**inputs, output_hidden_states=True, return_dict=True).pooler_output.detach().cpu().numpy()

        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        normalized_embeddings = embeddings / norms

        # convert to list
        normalized_embeddings_list = normalized_embeddings.tolist()
        return normalized_embeddings_list

AkiRusProd avatar Nov 07 '23 23:11 AkiRusProd