weaviate icon indicating copy to clipboard operation
weaviate copied to clipboard

NearVector Search Returns Non-Existing Objects

Open the-powerpointer opened this issue 2 years ago • 8 comments

How to reproduce this bug?

Unfortunately I cannot tell how to reproduce this generally. In my setup however, I noticed that search with nearVector returns objects, which are not existing anymore. They existed formerly, but were deleted already some while ago.

What is the expected behavior?

When starting a search using nearVector, e.g.

{
	Get {
		KnowledgeStoreObject(
			nearVector: {
				vector: [ <- 1536 -> ] -> shortened
			}
			limit: 10
		) {
			text
			cleanser_version
			_additional {
				id
			}
		}
	}
}

Each resulting object has to be present in the store. When retrieving an object from the result set by ID using REST API the object has to be retrieved. Searching with where filter for certain unique properties (like cleanser_version in my case) of those objects should retrieve these objects.

What is the actual behavior?

In my search I get several objects back which exist, and others which were already deleted some while ago. I can detect those objects which were already deleted, by custom property cleanser_version: All entries should have value 2, as all entries with value 1 have already been deleted. Search result (shortened):

{
	"data": {
		"Get": {
			"KnowledgeStoreObject": [
				{
					"_additional": {
						"id": "d3a5fb05-ce87-4aff-938f-84513192702d"
					},
					"cleanser_version": 2,
					"text": "{\"Header 1\":\"Available Plans in the Kyma Environment\"}\nDepending on your global account type, you have access to a different plan that specifies the cluster parameters for the Kyma environment.\n"
				},
				{
					"_additional": {
						"id": "d9391d4d-4520-475a-a43b-9cdfe8745eb3"
					},
					"cleanser_version": 1,
					"text": "{\"Header 1\":\"Available Plans in the Kyma Environment\"}\nDepending on your global account type, you have access to a different plan that specifies the cluster parameters for the Kyma environment.\n"
				},
			]
		}
	}
}

Querying for suspicious result objects via REST API (e.g. <baseUrl>/v1/objects/d9391d4d-4520-475a-a43b-9cdfe8745eb3) returns

{
	"error": [
		{
			"message": "no object with id 'd9391d4d-4520-475a-a43b-9cdfe8745eb3'"
		}
	]
}
(which is expected, as the object was deleted)

Searching for unique properties via where filter, e.g.

{
	Get {
		KnowledgeStoreObject(
			where: {
				path: ["cleanser_version"]
				operator: Equal
				valueNumber: 1
			}
		) {
			_additional {
				id
			}
			text
			cleanser_version
		}
	}
}

does not return any object. (Whereas searching for value 2 returns all objects...). Again, this is expected, as the object was deleted.

What is not expected is, that the deleted objects show up in the result set...

I expect that the search index is somewhat corrupted. How to rebuild it?

Supporting information

Weaviate version: 1.21.2 Replica count: 1

Server Version

1.21.2

Code of Conduct

the-powerpointer avatar Dec 07 '23 09:12 the-powerpointer

Hi!

Is this a multi node deployment? Were you able to reproduce this in latest 1.22.6 version?

Thanks!

dudanogueira avatar Dec 09 '23 14:12 dudanogueira

No, as written in the description, I run with a replica count of 1. So the inconsistency does not come from a sync issue between different nodes. Our setup is that we run a nightly job scanning our data sources, checking if anything changed and then replace entries in the store, when the according source has changed. So there are nightly some deletions and some additions. This is probably where the search index is updated and where some inconsistency was introduced. Our data base is not big at all. In total we have around 25000 entries in the vector store, and each night approx. 20 of them are replaced. Occasionally we have to replace all entries, when we e.g. update some processing step in our data ingestion pipeline. Also then, the entries are replaced sequentially: first deleting some entries (all chunks which belong to a certain document), then add some new entries (updated chunks from the document), then the same for the next document, and the next, until all documents were processed. We do not even issue parallel request to the instance in this setup...

Regarding reproducibility: As stability is key for our system, for the moment we decided to switch to a working mode, where we rebuild the complete vector database from scratch at each run on a separate instance, and then do a transport of a backup of the freshly filled system to our productive instance. With that of course the issue is not reproducible anymore. However, as a full run (with all the necessary preprocessing steps and embedding calculations) takes several hours, we want to switch back to the mode of updating only changed entries on the live system. However, this will be tested next year. I'll update this issue with my findings, if and how I can reproduce the problem.

the-powerpointer avatar Dec 14 '23 20:12 the-powerpointer

/bounty $300

philipvollet avatar May 13 '24 11:05 philipvollet

💎 $300 bounty • Weaviate

Steps to solve:

  1. Start working: Comment /attempt #3868 with your implementation plan
  2. Submit work: Create a pull request including /claim #3868 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to weaviate/weaviate!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🔴 @Rutik7066 May 14, 2024, 4:33:47 PM WIP
🔴 @kumarvivek1752 Oct 20, 2024, 5:35:25 AM WIP
🟢 @zelosleone Dec 16, 2024, 12:47:22 PM #6672

algora-pbc[bot] avatar May 13 '24 11:05 algora-pbc[bot]

/attempt

Algora profile Completed bounties Tech Active attempts Options
@Rutik7066 8 bounties from 6 projects
Go, Rust,
TypeScript & more
Cancel attempt

rutikthakre avatar May 14 '24 16:05 rutikthakre

@philipvollet are we facing this issue with only GraphQL?

rutikthakre avatar May 16 '24 19:05 rutikthakre

Hey bounty hunter 😏, yes, I only used GraphQL. However, until now I was not yet able to reproduce the issue, even when we switched back to nightly updates of our data base. Still I'm missing a concept of how to reproduce the issue or even detect the issue if not by accident...

the-powerpointer avatar May 17 '24 06:05 the-powerpointer

/attempt #3868

Options

kumarvivek1752 avatar Oct 20 '24 05:10 kumarvivek1752

/attempt #3868

Options

zelosleone avatar Dec 16 '24 12:12 zelosleone