chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: remove the duplicate data

Open zyu opened this issue 1 year ago • 4 comments

Describe the problem

I want to use chroma to remove the duplicate data, I have a lot of duplicate data in the data, I looked at the interface has no way to redo, there is no way for me to remove the duplicate data

Describe the proposed solution

remove the duplicate data

Alternatives considered

remove the duplicate data

Importance

would make my life easier

Additional Information

No response

zyu avatar Mar 29 '24 14:03 zyu

I actually fixed this by making my own filter function within the code where i retrieve my top K docs.

def filter_top_k_docs(
    top_k_docs: dict
) -> dict:
    """
    This function is made to filter out any duplicate entries from the chromadb query results:
    
    Args:
        top_k_docs (dict): A dictionary containing various elements from the search results

    Returns:
        dict: The top k results that is filtered from the chromadb query results
    """

    embeddings = top_k_docs['embeddings'][0]
    unique_embeddings = {}
    new_indices = []

    for idx, embedding in enumerate(embeddings):
        embedding_tuple = tuple(embedding)  
        if embedding_tuple not in unique_embeddings:
            unique_embeddings[embedding_tuple] = idx
            new_indices.append(idx)

   
    for key in top_k_docs:
        if top_k_docs[key] is not None:
            top_k_docs[key] = [[item for idx, item in enumerate(top_k_docs[key][0]) if idx in new_indices]]

    return top_k_docs

ceyhuncakir avatar Apr 01 '24 10:04 ceyhuncakir

Traceback (most recent call last): File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 652, in raise_chroma_error resp.raise_for_status() File "/opt/miniconda3/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8090/api/v1/collections/183b4fe9-b24a-4136-ad77-0943677de6a5/get

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/zyu/code/repeat/updatetime.py", line 38, in r = collection.get( ^^^^^^^^^^^^^^^ File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/models/Collection.py", line 211, in get get_results = self._client._get( ^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/telemetry/opentelemetry/init.py", line 127, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 436, in _get raise_chroma_error(resp) File "/opt/miniconda3/lib/python3.12/site-packages/chromadb/api/fastapi.py", line 654, in raise_chroma_error raise (Exception(resp.text)) Exception: {"error":"IndexError('list assignment index out of range')"}

zyu avatar Apr 07 '24 03:04 zyu

i dont know if its still relevant, but can i see the code?.

ceyhuncakir avatar Apr 15 '24 15:04 ceyhuncakir

@zyu, how are you tracking your data in Chroma? Is that via the document ID or metadata? I think it is essential to understand how you make sure that content is a duplicate of something previously added.

tazarov avatar Apr 16 '24 05:04 tazarov

I've been using the hash code of a document as its ID and leveraging the upsert function to add or update documents in the vector store. This approach effectively removes duplicate documents, as only one instance of each unique document is stored.

However, I've encountered significant challenges when it comes to updating or deleting documents. Here’s a detailed breakdown of the issue:

Scenario: Document Duplication Across Multiple Files: Suppose a specific document (or chunk) is duplicated across several files. Single Instance in Vector Store: Using the hash-based ID and upsert method, only one instance of this document is added to the vector store, regardless of how many files it appears in. Maintaining File References: Ideally, I need to maintain a list of files that reference this document. This list would allow me to track all files that contain the document.

Problems: Update Issues: When a document needs to be updated, there is no straightforward way to identify all the files that reference the document. This complicates ensuring that the document’s associations remain accurate.

Delete Issues: If I want to delete a specific file and all documents originating from that file, I face difficulties. Since only one instance of the document exists in the vector store, deleting the document could inadvertently remove it from other files that also reference it.

This challenge is particularly difficult because the metadata doesn't support lists or sets. A workaround is to store lists as strings and then convert these strings back to lists every time I .get() the data. However, this workaround necessitates fetching all the data upfront, which makes the process unnecessarily cumbersome and inefficient.

syshin0116 avatar May 31 '24 07:05 syshin0116