lopq icon indicating copy to clipboard operation
lopq copied to clipboard

The parameters on large dataset

Open kr11 opened this issue 6 years ago • 4 comments

I find there are many parameters in training phase. Have you run this project on large datasets, like SIFT1M(even SIFT1B) and GIST1M? And how to choose the appropriate parameters? Thanks a lot!

kr11 avatar Mar 14 '18 11:03 kr11

Yes, we have run this on large datasets—the Spark version is useful if the dataset will not fit in machine memory. I recall having run it on a dataset of a few 100m points. You should be able to run SIFT1M locally, but it make take an hour or so if I recall.

There are a number of parameters to decide on. I would refer you to this guide (particularly section 2). You may also be interested in these Python functions that can be used to compute computational cost as a function of parameters values.

I'm happy to try to answer specific questions about parameter settings if you have them.

pumpikano avatar Mar 14 '18 14:03 pumpikano

BTW, It is not well documented, but the library contains a function lopq.utils.load_xvecs that can convert the binary format of SIFT1M to a numpy array.

pumpikano avatar Mar 14 '18 14:03 pumpikano

@pumpikano Do you run LOPQ on spark? Could you please share your experience here ? I have a large dataset that my single server can't run

xhappy avatar Mar 01 '19 00:03 xhappy

Sorry, I haven't worked on this in years. At the time, we ran a Java implementation of LOPQ search (which was never part of this open-source project) and simply sharded the index on multiple machines. There is a branch of this repo that has an implementation that uses Spark to accomplish the sharding and serving (https://github.com/yahoo/lopq/blob/spark-search-cluster/spark/spark_lopq_cluster.py). I would strongly recommend against this for any production use case though — it was only intended to help test a large index within a Spark workflow, and we had a separate, battle-hardened index based on https://vespa.ai/.

pumpikano avatar Mar 03 '19 19:03 pumpikano