Hao (Harin) Wu

Results 10 comments of Hao (Harin) Wu

I opened a PR for this, setting a random bit in the ID hashing function seems like a harmless and quick fix. Not entirely sure of the implications of turning...

Improved the runtime by using a map https://github.com/embedchain/embedchain/pull/160 Also as is this will cause the `if not data_dict:` to not equate to true, exposing a bug with the zip data_dict....

> Haven't looked into this, can't we handle the error and retry with the random bit? otherwise it undermines the deduplication function. I think if that's the case, you'll need...

The `existing_ids = set(existing_docs["ids"])` call looks like it's useless, if there ever actually is a common id, the chroma db get will fail. We need to be sure that `ids...

I don't think we should be ignoring the chunks that are duplicated. Ultimately, they get stored in the db and queried with 1 result ``` result = self.collection.query( query_texts=[input_query,], n_results=1,...

I'm not referring to a conversation, I'm referring to natural repetitions within a dataset. Having repeated chunks in the dataset can suggest to the model that a piece of information...

Think of it this way. Here's my sample dataset, it's a PDF of a conversation: ``` Me: "I have something important to tell you!" Leo: "Tell me!" Leo: "Hello??" Leo:...

> Gotcha. How would a model be able to discern that the chunks are distinct if they are identical and thus infer relevance? In actuality, I'd imagine the chunks would...

@cachho mentioned a good point about `i just dump new stuff in there and click my train button. With the random bits, it would train the whole database every time,...

Removed the list metadata issue fix exposed by this change as that's solved in #110