datasketch
datasketch copied to clipboard
Inserting million of records on Minhash LSH Index over Redis
Hi, I've working on inserting million of records. 17 821 775 specifically, the problem is that the average rate of insertion is 162.8 records per second and a full insertion takes from 17 to 34 hours, is there anyway to speed up this insertion or any recomendation?
[EDIT]
Insertion session is slower than using insert directly from MinHashLSH Object
Thanks for the issue. I am not an expert in redis. Is there any way to build the index in parallel in a batch-only model with transaction turned off?
I checked that the insertion is not actually so slow, what takes some time is generating the minhash object and updating it 17 hours for 32 permutations, 34 hours for 64 permutations, 68 hours for 128 permutations...
Is there anyway to improve the performance of object generation?
MinHash generation is CPU bound so I think distributing the tasks to parallel workers is a good strategy. I use Cellery for this.
After some research I determine that generating hashes doesn't take a lot of time, but Redis does
Any idea about improving the performance, more permutations, more time taken inserting them, could we have a better performance using cassandra or mongo?
How about you have parallel workers inserting minhashes into separate Redis indexes, and then merge the .rdb files into one using redis-rdb-tools?
For Celery + MinHash example maybe you can refer to the findopendata project (https://github.com/findopendata/findopendata/blob/master/findopendata/indexing.py).