mem0 Prevent clashing chunk IDs

This PR solves the issue in #64

With a large enough data set, the clashing occurs relatively often. We can quickly solve this by adding a random bit into the hash function. Alternatively, we can explore using a set for ids instead of a list, which might compromise the data. (ie. what if you're trying to use the bot to make up its own language syntax and communicate via only 10 words)

Jul 06 '23 15:07 Harin329

Haven't looked into this, can't we handle the error and retry with the random bit? otherwise it undermines the deduplication function.

Jul 06 '23 15:07 cachho

Haven't looked into this, can't we handle the error and retry with the random bit? otherwise it undermines the deduplication function.

I think if that's the case, you'll need to rebuild that id array. But you're right, there is a deduplication function, so maybe it's better to just use a set from the beginning then. And we won't need that deduplication step here. It just doesn't feel right to get rid of data.

Jul 06 '23 15:07 Harin329

The existing_ids = set(existing_docs["ids"]) call looks like it's useless, if there ever actually is a common id, the chroma db get will fail.

We need to be sure that ids = embeddings_data["ids"] is already unique and also make sure it's synced with documents and metadatas. So in the base chunker, we can do something like this:

chunk_id = hashlib.sha256((chunk + url).encode()).hexdigest()
if (chunk_id in ids):
      chunk_id = hashlib.sha256((chunk + url + str(random.getrandbits(128))).encode()).hexdigest()
ids.append(chunk_id)
documents.append(chunk)
metadatas.append(meta_data)

But the chunk_id in ids check is wasteful and will affect performance. So my question is why not have a random bit in every chunk to keep them unique?

Jul 06 '23 16:07 Harin329

Also running into this problem when I am trying to embed a large dataset. I think the random id would fix the issue, but won't we see it affect the retrieving functionality? Two chunks that are identical should not be embedded. The random id circumnavigates the root problem. Maybe we could check to see if the id already exists and if it does then skip to the next chunk? I'm happy to contribute, though I am new to the project and this tech.

Jul 06 '23 17:07 TrentConley

I made a PR #165 that solves the issue

Jul 06 '23 17:07 TrentConley

I don't think we should be ignoring the chunks that are duplicated. Ultimately, they get stored in the db and queried with 1 result

result = self.collection.query(
            query_texts=[input_query,],
            n_results=1,
        )

but there's an open PR that is looking to change that to an arbitrary n_results.

It's a question of prompt engineering as that's what these chunks ultimately become. To me there's no harm in having more data, especially if it's present naturally in the dataset.

If your prompt was: "Given the following context, reply to me: Leo: "Hey!" Leo: "Hey!" Leo: "Hey!" "

I'd expect my bot ultimately to have a different behavior than a prompt like (due to the repeated messages): "Given the following context, reply to me: Leo: "Hey!" "

This would not be captured if we ignore that data, as chroma db will always return 1 context, even when n_results is set high. If you think my reasoning is flawed, please let me know :)

Jul 06 '23 18:07 Harin329

I'd like to distinguish conversation history from database context here. In a conversation, yes you would want to keep repeats. However, in a vector db that stores things for you to look up, ideally, you'd want distinct chunks of information so you can have an LLM to pick which best provides context to the question. Identical chunks would take up tokens, and potentially make you miss out on important context that was edged out by the repeated chunks.

So, when chunking to store in vector db, you want to ignore repeats.

Also, I think the issue here was chunking large files like a book to embed in chroma db? Where are conversations coming in here?

Feel free to poke holes in my reasoning, or if I mischaracterized what you're saying :)

Jul 06 '23 18:07 TrentConley

I'm not referring to a conversation, I'm referring to natural repetitions within a dataset. Having repeated chunks in the dataset can suggest to the model that a piece of information is more relevant. The repeated Leo: "Hey!" phrasing would have been present and highly repeated in the original dataset.

Jul 06 '23 19:07 Harin329

Gotcha. How would a model be able to discern that the chunks are distinct if they are identical and thus infer relevance? In actuality, I'd imagine the chunks would be larger than small repeated phrases, and those repeated phrases would be contained within a chunk that the LLM could then use.

Jul 06 '23 19:07 TrentConley

Think of it this way. Here's my sample dataset, it's a PDF of a conversation:

Me: "I have something important to tell you!"
Leo: "Tell me!"
Leo: "Hello??"
Leo: "Hello??"
Leo: "Hello??"
Me: "Hi, sorry I was busy.

Assume for simplicity we end up chunking each line. With the salt, we'll end up with 6 chunks. With an in-array check, we end up with 4 chunks.

These get stored in the db. Now I come in with an input query to my bot "I have something important to tell you!" If we used the array check, the db will only think there is a single Leo: "Hello??" and it will not provide the prompt context that the bot could be repetitive in nature.

The prompt context (with n=4) will look like this:

Me: "I have something important to tell you!"
Leo: "Tell me!"
Leo: "Hello??"
Me: "Hi, sorry I was busy.

instead of

Me: "I have something important to tell you!"
Leo: "Tell me!"
Leo: "Hello??"
Leo: "Hello??"

The inference model will likely react differently given the two scenarios. And if for any reason you're trying to capture repetitive behavior, it will fail to do that as repetitions simply won't exist in the chroma context db. At the end of the day, the whole purpose of this chunking is to help create the prompt. In which case, I think we should keep as much data in it's original format as possible from the dataset provided.

Jul 06 '23 19:07 Harin329

Gotcha. How would a model be able to discern that the chunks are distinct if they are identical and thus infer relevance? In actuality, I'd imagine the chunks would be larger than small repeated phrases, and those repeated phrases would be contained within a chunk that the LLM could then use.

From what it looks like, chroma db just queries given an input

result = self.collection.query(
            query_texts=[input_query,],
            n_results=1,
        )

I imagine this means it will return n_results most similar in the embedding space, which should include repetitions if that is a characteristic of the original dataset. The model doesn't need to discern uniqueness from the dataset, the dataset should provide the most relevant embeddings as a prompt to the model. But yes you're probably right that we're oversimplifying how the chunks will look like, but I think the idea still stands. As the purpose is for prompt creation, repeated chunks will create repeated context which could be intended by the user.

Jul 06 '23 19:07 Harin329

I understand your concern that in some cases you'd want to retain the identical chunks. Perhaps an ideal solution would be to create an option for the user to embed identical chunks, using your method of adding a random bit.

Jul 06 '23 20:07 TrentConley

@cachho mentioned a good point about i just dump new stuff in there and click my train button. With the random bits, it would train the whole database every time, right. So it's not just inserting what's missing. So in that case random bits would cause issues.

Made modifications to make the ID duplicate finding more time efficient and also added a type cast, which would cause failures otherwise with that method

Jul 06 '23 20:07 Harin329

Removed the list metadata issue fix exposed by this change as that's solved in #110

Jul 07 '23 15:07 Harin329

This isn't as scary as it looks, @taranjeet green light from me if they fix the last complaint I posted. thanks.

Jul 07 '23 16:07 cachho

tried your PR. it prevents inserting a chunk twice, which is a good way to handle the error. The only concern that I have is user will want to refer some chunks above and some chunks below a particular chunk.

And in this case, if we insert a chunk only once, and then for all the other occurences of the chunk, pre/post context will be missing. Also for this we will have to introduce a concept of sequence id. Right now merging this PR, but opening an issue to track this request.

We will have to introduce a concept of sequence id first, so that we can get 3 (for eg) chunks above the chunk and 3 below. We will also have to figure out to insert repeating chunks into the db. We are skipping inserting repeating chunks and inserting only one time, but we need to figure out a way.

Jul 08 '23 04:07 taranjeet

@Harin329 @cachho: good work on this. Merged and closed.

Jul 08 '23 05:07 taranjeet