chroma
                                
                                 chroma copied to clipboard
                                
                                    chroma copied to clipboard
                            
                            
                            
                        [Feature Request]: remove the duplicate data
Describe the problem
I want to use chroma to remove the duplicate data, I have a lot of duplicate data in the data, I looked at the interface has no way to redo, there is no way for me to remove the duplicate data
Describe the proposed solution
remove the duplicate data
Alternatives considered
remove the duplicate data
Importance
would make my life easier
Additional Information
No response
I actually fixed this by making my own filter function within the code where i retrieve my top K docs.
def filter_top_k_docs(
    top_k_docs: dict
) -> dict:
    """
    This function is made to filter out any duplicate entries from the chromadb query results:
    
    Args:
        top_k_docs (dict): A dictionary containing various elements from the search results
    Returns:
        dict: The top k results that is filtered from the chromadb query results
    """
    embeddings = top_k_docs['embeddings'][0]
    unique_embeddings = {}
    new_indices = []
    for idx, embedding in enumerate(embeddings):
        embedding_tuple = tuple(embedding)  
        if embedding_tuple not in unique_embeddings:
            unique_embeddings[embedding_tuple] = idx
            new_indices.append(idx)
   
    for key in top_k_docs:
        if top_k_docs[key] is not None:
            top_k_docs[key] = [[item for idx, item in enumerate(top_k_docs[key][0]) if idx in new_indices]]
    return top_k_docs
Traceback (most recent call last): File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 652, in raise_chroma_error resp.raise_for_status() File "/opt/miniconda3/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8090/api/v1/collections/183b4fe9-b24a-4136-ad77-0943677de6a5/get
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/zyu/code/repeat/updatetime.py", line 38, in 
i dont know if its still relevant, but can i see the code?.
@zyu, how are you tracking your data in Chroma? Is that via the document ID or metadata? I think it is essential to understand how you make sure that content is a duplicate of something previously added.
I've been using the hash code of a document as its ID and leveraging the upsert function to add or update documents in the vector store. This approach effectively removes duplicate documents, as only one instance of each unique document is stored.
However, I've encountered significant challenges when it comes to updating or deleting documents. Here’s a detailed breakdown of the issue:
Scenario: Document Duplication Across Multiple Files: Suppose a specific document (or chunk) is duplicated across several files. Single Instance in Vector Store: Using the hash-based ID and upsert method, only one instance of this document is added to the vector store, regardless of how many files it appears in. Maintaining File References: Ideally, I need to maintain a list of files that reference this document. This list would allow me to track all files that contain the document.
Problems: Update Issues: When a document needs to be updated, there is no straightforward way to identify all the files that reference the document. This complicates ensuring that the document’s associations remain accurate.
Delete Issues: If I want to delete a specific file and all documents originating from that file, I face difficulties. Since only one instance of the document exists in the vector store, deleting the document could inadvertently remove it from other files that also reference it.
This challenge is particularly difficult because the metadata doesn't support lists or sets. A workaround is to store lists as strings and then convert these strings back to lists every time I .get() the data. However, this workaround necessitates fetching all the data upfront, which makes the process unnecessarily cumbersome and inefficient.