vault-ai icon indicating copy to clipboard operation
vault-ai copied to clipboard

Seems to forget about earlier documents

Open cs96and opened this issue 1 year ago • 4 comments

If I upload a few documents, then it seems to forget about ones that I uploaded earlier. Is there a limit to the number of documents or tokens it will store per user?

cs96and avatar Apr 18 '23 17:04 cs96and

Having the same issue. It seems to only be able to reference the most recently uploaded document in my testing, even when running locally using my own pinecone index.

eabjab avatar Apr 18 '23 17:04 eabjab

Having the same issue, only the latest uploaded documents are referenced. It seems to overwrite the index whenever you upload a new file.

AitoD avatar Apr 19 '23 16:04 AitoD

Same issue here as well, I suspect the issue might be related to Pinecone itself, uploaded multiple files and the Vectors stopped increasing around 4k afterwards uploading any new documents will only reference the last 2 documents for me currently.

I used 1536 for Dimensions and left the rest default. I uploaded my first document (XML File with multiple Item IDs) and asked it for the first and last ItemID in the document, it got it nearly right. Then I continued to upload more files with same structure, asked periodically to reference first and last ItemID and it started to behave as if it only saw the last 2 documents uploaded.

Ninn0x4F avatar Apr 19 '23 16:04 Ninn0x4F

The issue is in how the upload code assigns IDs here:

    for i, embedding := range embeddings {
		chunk := chunks[i]
		vectors[i] = PineconeVector{
			ID:     fmt.Sprintf("id-%d", i),
			Values: embedding,
			Metadata: map[string]string{
				"file_name": chunk.Title,
				"start":     strconv.Itoa(chunk.Start),
				"end":       strconv.Itoa(chunk.End),
				"title":     chunk.Title,
				"text":      chunk.Text,
			},
		}
	}

The ID of the vector is just a counter for the chunk in the single file, in a multi-file upload, the ID's will overlap (ID 001 will repeat for file 1 and file 2 etc.). The UPSERT operation will update or insert based on the ID, essentially overwriting vectors for pre-processed files.

Switching this to an ID that is a hash of the filename plus the chunk number would ensure uniqueness.

For example:

func HashFileName(filename string) string {
	hash := sha256.Sum256([]byte(filename))
	return hex.EncodeToString(hash[:])
}

//...

func (p *Pinecone) UploadEmbeddings(embeddings [][]float32, chunks []Chunk) error {
	// Prepare URL
	url := p.APIEndpoint + "/vectors/upsert"

	// Prepare the vectors
	vectors := make([]PineconeVector, len(embeddings))
	for i, embedding := range embeddings {
		vectorID := fmt.Sprintf("id-%s-%d", HashFileName(chunks[i].Title), i)
		vectors[i] = PineconeVector{
			ID:     vectorID,
			Values: embedding,
// ...

lonelycode avatar May 02 '23 02:05 lonelycode