langchain icon indicating copy to clipboard operation
langchain copied to clipboard

FAISS should allow you to specify id when using add_text

Open atisharma opened this issue 1 year ago • 3 comments

System Info

langchain 0.0.173 faiss-cpu 1.7.4

python 3.10.11 Void linux

Who can help?

@hwchase17

Information

  • [ ] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

It's a logic error in langchain.vectorstores.faiss.__add()

https://github.com/hwchase17/langchain/blob/0c3de0a0b32fadb8caf3e6d803287229409f9da9/langchain/vectorstores/faiss.py#L94-L100 https://github.com/hwchase17/langchain/blob/0c3de0a0b32fadb8caf3e6d803287229409f9da9/langchain/vectorstores/faiss.py#L118-L126

The id is not possible to specify as a function argument. This makes it impossible to detect duplicate additions, for instance.

Expected behavior

It should be possible to specify id of inserted documents / texts using the add_documents / add_texts methods, as it is in the Chroma object's methods.

As a side-effect this ability would also fix the inability to remove duplicates (see https://github.com/hwchase17/langchain/issues/2699 and https://github.com/hwchase17/langchain/issues/3896 ) by the approach of using ids unique to the content (I use a hash, for example).

atisharma avatar May 21 '23 16:05 atisharma

Unsure if this would be that great. Up to you to manage your own ids in the metadata.

Xmaster6y avatar May 23 '23 14:05 Xmaster6y

I saw your comment here: https://github.com/hwchase17/langchain/issues/2699#issuecomment-1559487780

As I said there, it would work but it loops over every vector in the store to compare (n) and again for those to remove (m). That would be horribly slow with large vectorstores (n x m).

Using the vectorstore's own ID mechanism would be both simpler and faster, surely?

atisharma avatar May 23 '23 15:05 atisharma

Fundamentally I think you're totally right. It just seemed to me a real pain but why not...

Maybe you could develop what you had in mind.

I see multiple approaches to do that, but the main work will be to consistently add those new features to other vectorstores and also docstores I guess.

Xmaster6y avatar May 23 '23 17:05 Xmaster6y

I should have mentioned in the bug report that langchain's chroma db abstraction does allow specifying ids. It's a pretty simple fix which I could offer a pull request for, but there's quite a big queue including another pull request I made, so was hesitant...

If you think it's worth doing I can produce a pull request.

atisharma avatar May 23 '23 18:05 atisharma

Hmm, in that case, why not. Up to you to create the PR. I'll be happy to help, so you can tag me for review or other.

If the ids are specified, it will be much easier to handle delete and update. I guess it should let the user choose whether or not to use custom ids, but it seems overall worth it.

Xmaster6y avatar May 23 '23 18:05 Xmaster6y

Cool. Thanks. I will try to stay consistent with the Chroma interface and will tag you.

atisharma avatar May 23 '23 19:05 atisharma