lshensemble
lshensemble copied to clipboard
Can we add new domains into existing LSH indexers?
Hi, I've read the original great paper :thumbsup: and the repo's readme.md. Now I have a question: Can I add new domain records into an existing indexer?
For example if I create an indexer with 1 billion records using
index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK,
len(domainRecords), lshensemble.Recs2Chan(domainRecords))
After the creation, I get 1 million new records again. Can I add them to the exist index_eqd? Or I can only create a new indexer with 1 billion + 1 million records.
I am facing a similar issue. With the MinHash LSH one can query and then add more hashes like this:
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
result = lsh.query(m1)
#Pickle lsh
#Unpickle lsh later
lsh.insert("m3", m3) #I can add more MinHash(es) later and then query
result = lsh.query(m1)
But with MinHash LSH Ensemble
, you can only run .index()
once as explained in the code.
I have a setup where I want to:
- Create an LSHEnsemble with the data i have -> Call it
LSH
- Query for duplicates with
LSH
. - Add more MinHash(es) to
LSH
-- giving meLSHnew
- Query with
LSHnew
.
How can I do this please? @ekzhu
Hi, I've read the original great paper 👍 and the repo's readme.md. Now I have a question: Can I add new domain records into an existing indexer?
For example if I create an indexer with 1 billion records using
index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, len(domainRecords), lshensemble.Recs2Chan(domainRecords))
After the creation, I get 1 million new records again. Can I add them to the exist index_eqd? Or I can only create a new indexer with 1 billion + 1 million records.
You will need to create another index for your new records. The created index itself is frozen and can't be updated.
The code snippet is from datasketch Python library. For this Go library, there isn't an update option.