langchainrb icon indicating copy to clipboard operation
langchainrb copied to clipboard

Preventing duplicates and noise in embeddings

Open drale2k opened this issue 2 years ago • 2 comments

I think, even if not yet in scope for lanchianrb, this should be discussed as people will inevitably come across this problem. Especially when embedding documents with langchainrb, what is a good strategy to prevent the same document / strings being re-added repeatedly?

For a whole document i think checksums could work (although for big docs computing a checksum will increase) - but what about individual pages of a document or text chunks? Would love some guidance and maybe later down the road langchain can help with this.

drale2k avatar Aug 23 '23 14:08 drale2k

It seems this is done through indexing

I wonder if there's a roadmap on porting this feature into langchainrb

mengqing avatar Oct 14 '23 05:10 mengqing

Thanks that's really useful. Would be great to have something like this in langchainrb. At least a basic version to start with as it is a real PITA to do this manually

drale2k avatar Feb 03 '24 16:02 drale2k

I'll be frank -- I'd to rethink the whole data parsing -> chunking -> embedding pipeline first before adding more functionality on top of what's currently there.

andreibondarev avatar Oct 24 '24 00:10 andreibondarev