Add check to prevent adding duplicate chunks in create_chunks method
This PR solves the issue in #64 without affecting vector db functionality.
Added a check in the create_chunks method of the BaseChunker class to prevent adding duplicate chunks. The check compares the chunk ID, generated using hashlib.sha256, with the existing ids list before appending the chunk to documents. This ensures that only unique chunks are added to the documents list, preventing duplicates. The change improves the efficiency and accuracy of chunk processing in the create_chunks method.
I added the
if (chunk_id not in ids):
line to check that a chunk does not already exist. sha256 will essentially never have collision, so we can assume chunks are unique to hash. code under chunkers/base_chunker:
for chunk in chunks:
chunk_id = hashlib.sha256((chunk + url).encode()).hexdigest()
if (chunk_id not in ids):
ids.append(chunk_id)
documents.append(chunk)
metadatas.append(meta_data)
hey @TrentConley , thanks. let me get back on this tomorrow.
this makes total sense. If the hash is already in the database, there's no reason to try to re-add it. Green light from me
Improved the runtime by using a map https://github.com/embedchain/embedchain/pull/160
Also as is this will cause the if not data_dict: to not equate to true, exposing a bug with the zip data_dict. The metadata needs to be casted from it's tuple form. Also fixed on #160
Closing this, #160 is the main place this is tracked.