datasketch icon indicating copy to clipboard operation
datasketch copied to clipboard

Store MinHashLSH in redis, when do the query operation it takes too long?

Open MrRace opened this issue 4 years ago • 1 comments

Hi, I build MinHashLSH like that:

self.lsh = MinHashLSH(
                    threshold=0.7
                    num_perm=128
                    storage_config={
                        'type': 'redis',
                        'basename': b'test_',
                        'redis': {'host': host_ip, 'port': host_port, 'password': host_password, 'db': db_num,
                                  },
                    }

When do query like that:

new_task_text="mytext"
new_text_hash = MinHash(num_perm=128)
new_text_hash.update_batch([s.encode('utf-8') for s in new_task_text])
newminhash_end_time = time.time()
query_start_time = time.time()
similar_text_ids = self.lsh.query(new_text_hash) 
query_end_time = time.time()
print("query_cost_time=", query_end_time-query_start_time)  # 28ms

the query operation cost 20ms, does it seems to take too long time? Is there any way to improve it? Thanks a lot!

MrRace avatar Nov 24 '21 12:11 MrRace

It is using redis as external storage layer so there is overhead for sure depending on where your Redis instance is running. How about using the simple Python in-memory storage (i.e., without specifying any storage config)?

ekzhu avatar Dec 04 '21 06:12 ekzhu