spring-ai icon indicating copy to clipboard operation
spring-ai copied to clipboard

Avoid duplicated entries in VectorStore(s) by allowing generation of Document ID based on the hashed document content.

Open tzolov opened this issue 2 years ago • 1 comments
trafficstars

Currently the Document if not provided with an explicit ID, generates a random UUID for every document. Even if the document content/metadata haven't changed a new ID is generated every time. This will lead to document content duplications in the Vector store.

To prevent this type of unnecessary duplications we can allow generation of Document ID based on the hashed document content+metadata.

Following snippet is inspired by a langchain4j vector store implementations.

import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

....

public static String generateIdFrom(String contentWithMetadata) {
    try {
	    byte[] hashBytes = MessageDigest.getInstance("SHA-256").digest(contentWithMetadata.getBytes(StandardCharsets.UTF_8));
	    StringBuilder sb = new StringBuilder();
	    for (byte b : hashBytes) {
		    sb.append(String.format("%02x", b));
	    }
	    return UUID.nameUUIDFromBytes(sb.toString().getBytes(StandardCharsets.UTF_8)).toString();
    }
    catch (NoSuchAlgorithmException e) {
	    throw new IllegalArgumentException(e);
    }
}

tzolov avatar Nov 18 '23 11:11 tzolov

Currenlty we have

	public Document(String content, Map<String, Object> metadata) {
		this(UUID.randomUUID().toString(), content, metadata);
	}

Perhaps we can add a strategy interface as an option to pass in with an implementation based on what is listed above.

	public Document(String content, Map<String, Object> metadata, IdGenerator idGenerator) {

...

markpollack avatar Jan 10 '24 17:01 markpollack

I have just created a PR for this feature (my first PR here): https://github.com/spring-projects/spring-ai/pull/272

nurlicht avatar Jan 25 '24 22:01 nurlicht