autofaiss issues

Results 56 autofaiss issues

Sort by recently updated

try to use memory mapping instead of holding all training vectors in memory

might also be done for the index itself would unlock training with more points and building larger indices with a lower memory

rom1504

consider using distributed kmeans in distributed mode for a better training

https://github.com/facebookresearch/faiss/blob/b8fe92dfee9ea6f9c8cae27e4fc3ffeb12b5c4d2/benchs/distributed_ondisk/README.md#distributed-k-means

rom1504

decrease memory used by merging

Currently merging in distributed mode requires to store the whole index in memory Possible strategies: * improve faiss merge into to avoid putting everything in memory * producing N index...

rom1504

add a test using hdfs as a file system to validate file system support

using https://github.com/beyondstorage/setup-hdfs

rom1504

use embedding reader start:end feature to get a proper sampled training set

same for the evaluation set currently we use the first N vectors for both training and evaluation which is not ideal, especially if the embedding set is not randomly shuffled

rom1504

make autofaiss not use TemporaryDirectory

`TemporaryDirectory` is a local folder which may not have any room the user should specify what is the temporary folder (in fact we already have an option for this)

rom1504

use merging strategy in non-pyspark mode as well

the strategy to create a few small indices the memory usage during adding and (if using the special merge on disk function) completely cap the memory used by autofaiss in...

rom1504

fix estimation of training memory used by autofaiss

just tried it and the new estimation at https://github.com/criteo/autofaiss/pull/81/files doesn't fully capture the memory needed for training when training an index such as `OPQ32_224,IVF131072_HNSW32,PQ32x8` faiss trains the index in 2...

rom1504

make test_get_optimal_hyperparameters less slow

it takes many minutes to run it

rom1504

consider implementing the embedding id as an array of byte instead of long in faiss

it would decrease significantly the 8 byte overhead of each item Storing 2^63 items in an index is not possible

rom1504

autofaiss
autofaiss copied to clipboard

Metadata

try to use memory mapping instead of holding all training vectors in memory

consider using distributed kmeans in distributed mode for a better training

decrease memory used by merging

add a test using hdfs as a file system to validate file system support

use embedding reader start:end feature to get a proper sampled training set

make autofaiss not use TemporaryDirectory

use merging strategy in non-pyspark mode as well

fix estimation of training memory used by autofaiss

make test_get_optimal_hyperparameters less slow

consider implementing the embedding id as an array of byte instead of long in faiss

← Metadata

Owner

Metadata

autofaiss autofaiss copied to clipboard

Metadata

← Metadata

Owner

Metadata

autofaiss
autofaiss copied to clipboard