datasketch
datasketch copied to clipboard
Store MinHashLSH in redis, when do the query operation it takes too long?
Hi, I build MinHashLSH like that:
self.lsh = MinHashLSH(
threshold=0.7
num_perm=128
storage_config={
'type': 'redis',
'basename': b'test_',
'redis': {'host': host_ip, 'port': host_port, 'password': host_password, 'db': db_num,
},
}
When do query like that:
new_task_text="mytext"
new_text_hash = MinHash(num_perm=128)
new_text_hash.update_batch([s.encode('utf-8') for s in new_task_text])
newminhash_end_time = time.time()
query_start_time = time.time()
similar_text_ids = self.lsh.query(new_text_hash)
query_end_time = time.time()
print("query_cost_time=", query_end_time-query_start_time) # 28ms
the query operation cost 20ms, does it seems to take too long time? Is there any way to improve it? Thanks a lot!
It is using redis as external storage layer so there is overhead for sure depending on where your Redis instance is running. How about using the simple Python in-memory storage (i.e., without specifying any storage config)?