vault-ai
vault-ai copied to clipboard
Multiple file uploads overwriting previous embeddings
vectors := make([]PineconeVector, len(embeddings))
for i, embedding := range embeddings {
chunk := chunks[i]
vectors[i] = PineconeVector{
ID: fmt.Sprintf("id-%d", i),
Values: embedding,
Metadata: map[string]string{
"file_name": chunk.Title,
"start": strconv.Itoa(chunk.Start),
"end": strconv.Itoa(chunk.End),
"title": chunk.Title,
"text": chunk.Text,
},
}
}
This works well for batched uploads, though the previous embeddings are overwritten for multiple uploads. UUID's would allow for multiple uploads unless this was intentional to prevent your Pinecone instance from becoming massive. If that's the case, perhaps you could have a public
flag, which would use some other ID scheme for private instances?
I fixed this by changing the id to uuid so it doesnt overwrite the previously stored files when uploading new stuff.
Will add a pr later today.
Any updates on this issue? @AitoD Can you please elaborate on what you changed to use uuids rather than ids? I am running into the overwriting issue when trying to compile a knowledge-base.
@AitoD
I tried changing a few things in the pinecone.go and postapi files but kept experiencing issues with npm start giving me an error of.
postapi\pinecone.go:29:14: uuid.New undefined (type string has no field or method New) ... error
and
postapi\pinecone.go:97:2: syntax error: non-declaration statement outside function body ... error
Got it. I think this works:
diff --git a/vault-web-server/postapi/pinecone.go b/vault-web-server/postapi/pinecone.go
index 2d8f1bd..0f9fae3 100644
--- a/vault-web-server/postapi/pinecone.go
+++ b/vault-web-server/postapi/pinecone.go
@@ -11,6 +11,7 @@ import (
"math"
"net/http"
"strconv"
+ googleid "github.com/google/uuid"
)
type PineconeVector struct {
@@ -27,8 +28,9 @@ func upsertEmbeddingsToPinecone(embeddings [][]float32, chunks []Chunk, uuid str
vectors := make([]PineconeVector, len(embeddings))
for i, embedding := range embeddings {
chunk := chunks[i]
+ myuuid := googleid.NewString()
vectors[i] = PineconeVector{
- ID: fmt.Sprintf("id-%d", i),
+ ID: fmt.Sprintf("id-%s", myuuid),
Values: embedding,
Metadata: map[string]string{
"file_name": chunk.Title,
Basically import google uuid library with a different name so as not to be overwritten by func param, and then use it to generate a uuid.
I don't have the code handy but it seems to me that the code that needs to be changed is:
ID: fmt.Sprintf("id-%d", i)
According to ChatGPT one way to generate a UUID in go is :
package main
import (
"github.com/satori/go.uuid"
)
func main() {
myuuid, err := uuid.NewV4()
}
So that may do the trick. I'll try it later when I have access to the code.
Got it. I think this works:
diff --git a/vault-web-server/postapi/pinecone.go b/vault-web-server/postapi/pinecone.go index 2d8f1bd..0f9fae3 100644 --- a/vault-web-server/postapi/pinecone.go +++ b/vault-web-server/postapi/pinecone.go @@ -11,6 +11,7 @@ import ( "math" "net/http" "strconv" + googleid "github.com/google/uuid" ) type PineconeVector struct { @@ -27,8 +28,9 @@ func upsertEmbeddingsToPinecone(embeddings [][]float32, chunks []Chunk, uuid str vectors := make([]PineconeVector, len(embeddings)) for i, embedding := range embeddings { chunk := chunks[i] + myuuid := googleid.NewString() vectors[i] = PineconeVector{ - ID: fmt.Sprintf("id-%d", i), + ID: fmt.Sprintf("id-%s", myuuid), Values: embedding, Metadata: map[string]string{ "file_name": chunk.Title,
Basically import google uuid library with a different name so as not to be overwritten by func param, and then use it to generate a uuid.
I don't have the code handy but it seems to me that the code that needs to be changed is:
ID: fmt.Sprintf("id-%d", i)
According to ChatGPT one way to generate a UUID in go is :
package main import ( "github.com/satori/go.uuid" ) func main() { myuuid, err := uuid.NewV4() }
So that may do the trick. I'll try it later when I have access to the code.
Would I be putting this in the pinecone.go file? I received a syntax error so I imagine it's because I'm doing something wrong haha