mem0 icon indicating copy to clipboard operation
mem0 copied to clipboard

Add check to prevent adding duplicate chunks in create_chunks method

Open TrentConley opened this issue 2 years ago • 1 comments

This PR solves the issue in #64 without affecting vector db functionality.

Added a check in the create_chunks method of the BaseChunker class to prevent adding duplicate chunks. The check compares the chunk ID, generated using hashlib.sha256, with the existing ids list before appending the chunk to documents. This ensures that only unique chunks are added to the documents list, preventing duplicates. The change improves the efficiency and accuracy of chunk processing in the create_chunks method.

TrentConley avatar Jul 06 '23 17:07 TrentConley

I added the

if (chunk_id not in ids):

line to check that a chunk does not already exist. sha256 will essentially never have collision, so we can assume chunks are unique to hash. code under chunkers/base_chunker:

for chunk in chunks:
                chunk_id = hashlib.sha256((chunk + url).encode()).hexdigest()
                if (chunk_id not in ids):
                    ids.append(chunk_id)
                    documents.append(chunk)
                    metadatas.append(meta_data)

TrentConley avatar Jul 06 '23 17:07 TrentConley

hey @TrentConley , thanks. let me get back on this tomorrow.

taranjeet avatar Jul 06 '23 19:07 taranjeet

this makes total sense. If the hash is already in the database, there's no reason to try to re-add it. Green light from me

cachho avatar Jul 06 '23 19:07 cachho

Improved the runtime by using a map https://github.com/embedchain/embedchain/pull/160

Also as is this will cause the if not data_dict: to not equate to true, exposing a bug with the zip data_dict. The metadata needs to be casted from it's tuple form. Also fixed on #160

Harin329 avatar Jul 06 '23 21:07 Harin329

Closing this, #160 is the main place this is tracked.

cachho avatar Jul 07 '23 10:07 cachho