langchainrb
langchainrb copied to clipboard
Preventing duplicates and noise in embeddings
I think, even if not yet in scope for lanchianrb, this should be discussed as people will inevitably come across this problem. Especially when embedding documents with langchainrb, what is a good strategy to prevent the same document / strings being re-added repeatedly?
For a whole document i think checksums could work (although for big docs computing a checksum will increase) - but what about individual pages of a document or text chunks? Would love some guidance and maybe later down the road langchain can help with this.
It seems this is done through indexing
I wonder if there's a roadmap on porting this feature into langchainrb
Thanks that's really useful. Would be great to have something like this in langchainrb. At least a basic version to start with as it is a real PITA to do this manually
I'll be frank -- I'd to rethink the whole data parsing -> chunking -> embedding pipeline first before adding more functionality on top of what's currently there.