lshensemble icon indicating copy to clipboard operation
lshensemble copied to clipboard

Can we add new domains into existing LSH indexers?

Open QthCN opened this issue 4 years ago • 3 comments

Hi, I've read the original great paper :thumbsup: and the repo's readme.md. Now I have a question: Can I add new domain records into an existing indexer?

For example if I create an indexer with 1 billion records using

index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, 
    len(domainRecords), lshensemble.Recs2Chan(domainRecords))

After the creation, I get 1 million new records again. Can I add them to the exist index_eqd? Or I can only create a new indexer with 1 billion + 1 million records.

QthCN avatar Jan 15 '21 10:01 QthCN

I am facing a similar issue. With the MinHash LSH one can query and then add more hashes like this:

lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m2", m2)
result = lsh.query(m1)
#Pickle lsh
#Unpickle lsh later
lsh.insert("m3", m3)  #I can add more MinHash(es) later and then query
result = lsh.query(m1)

But with MinHash LSH Ensemble, you can only run .index() once as explained in the code.

I have a setup where I want to:

  1. Create an LSHEnsemble with the data i have -> Call it LSH
  2. Query for duplicates with LSH.
  3. Add more MinHash(es) to LSH -- giving me LSHnew
  4. Query with LSHnew.

How can I do this please? @ekzhu

chrisemezue avatar Jan 14 '22 14:01 chrisemezue

Hi, I've read the original great paper 👍 and the repo's readme.md. Now I have a question: Can I add new domain records into an existing indexer?

For example if I create an indexer with 1 billion records using

index_eqd, err := lshensemble.BootstrapLshEnsembleEquiDepth(numPart, numHash, maxK, 
    len(domainRecords), lshensemble.Recs2Chan(domainRecords))

After the creation, I get 1 million new records again. Can I add them to the exist index_eqd? Or I can only create a new indexer with 1 billion + 1 million records.

You will need to create another index for your new records. The created index itself is frozen and can't be updated.

ekzhu avatar Jan 24 '22 20:01 ekzhu

The code snippet is from datasketch Python library. For this Go library, there isn't an update option.

ekzhu avatar Jan 24 '22 20:01 ekzhu