weaviate
weaviate copied to clipboard
NearVector Search Returns Non-Existing Objects
How to reproduce this bug?
Unfortunately I cannot tell how to reproduce this generally.
In my setup however, I noticed that search with nearVector returns objects, which are not existing anymore. They existed formerly, but were deleted already some while ago.
What is the expected behavior?
When starting a search using nearVector, e.g.
{
Get {
KnowledgeStoreObject(
nearVector: {
vector: [ <- 1536 -> ] -> shortened
}
limit: 10
) {
text
cleanser_version
_additional {
id
}
}
}
}
Each resulting object has to be present in the store. When retrieving an object from the result set by ID using REST API the object has to be retrieved. Searching with where filter for certain unique properties (like cleanser_version in my case) of those objects should retrieve these objects.
What is the actual behavior?
In my search I get several objects back which exist, and others which were already deleted some while ago.
I can detect those objects which were already deleted, by custom property cleanser_version: All entries should have value 2, as all entries with value 1 have already been deleted.
Search result (shortened):
{
"data": {
"Get": {
"KnowledgeStoreObject": [
{
"_additional": {
"id": "d3a5fb05-ce87-4aff-938f-84513192702d"
},
"cleanser_version": 2,
"text": "{\"Header 1\":\"Available Plans in the Kyma Environment\"}\nDepending on your global account type, you have access to a different plan that specifies the cluster parameters for the Kyma environment.\n"
},
{
"_additional": {
"id": "d9391d4d-4520-475a-a43b-9cdfe8745eb3"
},
"cleanser_version": 1,
"text": "{\"Header 1\":\"Available Plans in the Kyma Environment\"}\nDepending on your global account type, you have access to a different plan that specifies the cluster parameters for the Kyma environment.\n"
},
]
}
}
}
Querying for suspicious result objects via REST API (e.g. <baseUrl>/v1/objects/d9391d4d-4520-475a-a43b-9cdfe8745eb3) returns
{
"error": [
{
"message": "no object with id 'd9391d4d-4520-475a-a43b-9cdfe8745eb3'"
}
]
}
(which is expected, as the object was deleted)
Searching for unique properties via where filter, e.g.
{
Get {
KnowledgeStoreObject(
where: {
path: ["cleanser_version"]
operator: Equal
valueNumber: 1
}
) {
_additional {
id
}
text
cleanser_version
}
}
}
does not return any object. (Whereas searching for value 2 returns all objects...). Again, this is expected, as the object was deleted.
What is not expected is, that the deleted objects show up in the result set...
I expect that the search index is somewhat corrupted. How to rebuild it?
Supporting information
Weaviate version: 1.21.2 Replica count: 1
Server Version
1.21.2
Code of Conduct
- [X] I have read and agree to the Weaviate's Contributor Guide and Code of Conduct
Hi!
Is this a multi node deployment? Were you able to reproduce this in latest 1.22.6 version?
Thanks!
No, as written in the description, I run with a replica count of 1. So the inconsistency does not come from a sync issue between different nodes. Our setup is that we run a nightly job scanning our data sources, checking if anything changed and then replace entries in the store, when the according source has changed. So there are nightly some deletions and some additions. This is probably where the search index is updated and where some inconsistency was introduced. Our data base is not big at all. In total we have around 25000 entries in the vector store, and each night approx. 20 of them are replaced. Occasionally we have to replace all entries, when we e.g. update some processing step in our data ingestion pipeline. Also then, the entries are replaced sequentially: first deleting some entries (all chunks which belong to a certain document), then add some new entries (updated chunks from the document), then the same for the next document, and the next, until all documents were processed. We do not even issue parallel request to the instance in this setup...
Regarding reproducibility: As stability is key for our system, for the moment we decided to switch to a working mode, where we rebuild the complete vector database from scratch at each run on a separate instance, and then do a transport of a backup of the freshly filled system to our productive instance. With that of course the issue is not reproducible anymore. However, as a full run (with all the necessary preprocessing steps and embedding calculations) takes several hours, we want to switch back to the mode of updating only changed entries on the live system. However, this will be tested next year. I'll update this issue with my findings, if and how I can reproduce the problem.
/bounty $300
💎 $300 bounty • Weaviate
Steps to solve:
- Start working: Comment
/attempt #3868with your implementation plan - Submit work: Create a pull request including
/claim #3868in the PR body to claim the bounty - Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts
Thank you for contributing to weaviate/weaviate!
Add a bounty • Share on socials
| Attempt | Started (GMT+0) | Solution |
|---|---|---|
| 🔴 @Rutik7066 | May 14, 2024, 4:33:47 PM | WIP |
| 🔴 @kumarvivek1752 | Oct 20, 2024, 5:35:25 AM | WIP |
| 🟢 @zelosleone | Dec 16, 2024, 12:47:22 PM | #6672 |
/attempt
| Algora profile | Completed bounties | Tech | Active attempts | Options |
|---|---|---|---|---|
| @Rutik7066 | 8 bounties from 6 projects | Go, Rust, TypeScript & more |
Cancel attempt |
@philipvollet are we facing this issue with only GraphQL?
Hey bounty hunter 😏, yes, I only used GraphQL. However, until now I was not yet able to reproduce the issue, even when we switched back to nightly updates of our data base. Still I'm missing a concept of how to reproduce the issue or even detect the issue if not by accident...