Inserting document with same `doc_id`
How would GPTSimpleVectorIndex react if I were to insert a document with doc_id of an already present document?
index = GPTSimpleVectorIndex()
document = Document(..., doc_id="id")
index.insert(document)
updated_document = Document(..., doc_id="id")
# 👇 What will happen here?
index.insert(updated_document)
Will it just update the document/nodes/vectors? Is it even safe to do so?
Also related: is there a way to check whether a document with some doc_id is present in the index?
I presume it's the role of GPTSimpleVectorIndex().docstore.document_exists(doc_id)?
It seems like GPTSimpleVectorIndex ignores the user-specified Document().doc_id, because the docstore.docs has only uuid V4 ids and no user-provided ones. Although, nodes do have ref_doc_id set to user's doc_id. Is there a reason for such behaviour? I would very much like to query documents by ids that I provide, and not the ones llama_index assigns. Or at least query by ref_doc_id.
Hi @TmLev, we do currently have an update function - which deletes the doc then inserts. Is that what you'd be looking for?
An alternative UX we're thinking is to just make our insert function an upsert instead
Hi @TmLev, we do currently have an
updatefunction - which deletes the doc then inserts. Is that what you'd be looking for?
My original question is "what would happen if I were to insert a document with an ID of an already present document?"
My original question is "what would happen if I were to insert a document with an ID of an already present document?"
Yeah i guess my point is that for those, the insert call should error, and you should really just be using the update function instead.
@TmLev heads up, going to close this issue for now unless you had additional issues to raise (feel free to reopen)