datasketch icon indicating copy to clipboard operation
datasketch copied to clipboard

Inserting million of records on Minhash LSH Index over Redis

Open variux opened this issue 4 years ago • 5 comments

Hi, I've working on inserting million of records. 17 821 775 specifically, the problem is that the average rate of insertion is 162.8 records per second and a full insertion takes from 17 to 34 hours, is there anyway to speed up this insertion or any recomendation?

[EDIT]

Insertion session is slower than using insert directly from MinHashLSH Object

variux avatar Apr 19 '20 17:04 variux

Thanks for the issue. I am not an expert in redis. Is there any way to build the index in parallel in a batch-only model with transaction turned off?

ekzhu avatar Apr 20 '20 06:04 ekzhu

I checked that the insertion is not actually so slow, what takes some time is generating the minhash object and updating it 17 hours for 32 permutations, 34 hours for 64 permutations, 68 hours for 128 permutations...

Is there anyway to improve the performance of object generation?

variux avatar Apr 20 '20 17:04 variux

MinHash generation is CPU bound so I think distributing the tasks to parallel workers is a good strategy. I use Cellery for this.

ekzhu avatar Apr 20 '20 18:04 ekzhu

After some research I determine that generating hashes doesn't take a lot of time, but Redis does

Any idea about improving the performance, more permutations, more time taken inserting them, could we have a better performance using cassandra or mongo?

variux avatar Apr 21 '20 03:04 variux

How about you have parallel workers inserting minhashes into separate Redis indexes, and then merge the .rdb files into one using redis-rdb-tools?

For Celery + MinHash example maybe you can refer to the findopendata project (https://github.com/findopendata/findopendata/blob/master/findopendata/indexing.py).

ekzhu avatar Apr 21 '20 06:04 ekzhu