Hao (Harin) Wu comments

Results 10 comments of


                                            Hao (Harin) Wu

Chroma DB Duplicate ID Error

I opened a PR for this, setting a random bit in the ID hashing function seems like a harmless and quick fix. Not entirely sure of the implications of turning...

Add check to prevent adding duplicate chunks in create_chunks method

Improved the runtime by using a map https://github.com/embedchain/embedchain/pull/160 Also as is this will cause the `if not data_dict:` to not equate to true, exposing a bug with the zip data_dict....

Prevent clashing chunk IDs

> Haven't looked into this, can't we handle the error and retry with the random bit? otherwise it undermines the deduplication function. I think if that's the case, you'll need...

Prevent clashing chunk IDs

The `existing_ids = set(existing_docs["ids"])` call looks like it's useless, if there ever actually is a common id, the chroma db get will fail. We need to be sure that `ids...

Prevent clashing chunk IDs

I don't think we should be ignoring the chunks that are duplicated. Ultimately, they get stored in the db and queried with 1 result ``` result = self.collection.query( query_texts=[input_query,], n_results=1,...

Prevent clashing chunk IDs

I'm not referring to a conversation, I'm referring to natural repetitions within a dataset. Having repeated chunks in the dataset can suggest to the model that a piece of information...

Prevent clashing chunk IDs

Think of it this way. Here's my sample dataset, it's a PDF of a conversation: ``` Me: "I have something important to tell you!" Leo: "Tell me!" Leo: "Hello??" Leo:...

Prevent clashing chunk IDs

> Gotcha. How would a model be able to discern that the chunks are distinct if they are identical and thus infer relevance? In actuality, I'd imagine the chunks would...

Prevent clashing chunk IDs

@cachho mentioned a good point about `i just dump new stuff in there and click my train button. With the random bits, it would train the whole database every time,...

Prevent clashing chunk IDs

Removed the list metadata issue fix exposed by this change as that's solved in #110