manticoresearch
manticoresearch copied to clipboard
KNN search (prototype)
This issue is to collect ideas about possible ways of implementation of KNN search in Manticore Search.
It's a no brainer, that the first thing to look into is HNSW - https://arxiv.org/abs/1603.09320 . It's used by Lucene, Weaviate, Milvus, Vespa and others.
Nick also did a quick research on HNSW etc. about a year ago (while it wasn't yet fully implemented in Lucene) and the results were as follows (they may be outdated or wrong):
As this feature is still in the process of being implemented, there aren't any official publications on the subject, like benchmarks or something, from Lucene.
I suppose they're aware of the potential problem there, but as they're now facing a few other challenges with the HNSW's implementation ( i.e., the handling of deleted docs or the presence of additional conditions/constraints in the original query which also were mentioned by the guy from Vespa, data structures optimization, considering index hyperparameters, etc.) they haven't considered this issue closely until the most recent time.
Besides, they're going to provide another algorithm for ann(a nearest neighbour)-search, a coarse quantization, along with the HNSW. It operates with linear structures under the hood so the indexing is faster than when using HNSW with its graphes. Thus, users seem to be able to decide what is more important for them, performance of indexing or performance of searching, and to choose the appropriate approach.
Just last week, they indeed compared the search performance of the C++ hnswlib that is the most widely used ann-lib with their HNSW implementation on standard ann-benchmarking datasets and found the gap in performance to be quite notable and “bothersome”.
As for the indexing stage, I haven't managed to find the info on its benchmarking.
In fact, they made a few steps before to prevent the possible performance problems. Firstly, they came to use just one-layered HNSW graphs to reduce the time of graphs' rebuilds when merging/updating happens and they also applied a more advanced JDK API to execute all the low-level operations on vectors more efficiently, but the result is obviously not satisfactory enough so they'll still be working on it.
As for Vespa, they don't use the concept of batch indexing at all so that's not an issue for them. They just use a single graph for the operations on the whole index. Probably it can cause problems with RAM resources when handling large indexes, but i haven’t really looked into details there.
I suggest to consider s2geometry library https://github.com/google/s2geometry It's on Apache-2.0 license
Perhaps it worth taking a step back and looking at what trying to implement here. I guess comes about as this is a feature in ElasticSearch?
Seems like a generalized KNN function could be used to implement a 'geographic' index, where the vectors are simply coordinate axis. But the KNN vectors can represent arbitrary values it seems, so couldn't be restricted to 2d spherical geometry.
If wanting spatial indexing, seems it would be best implemented as an attribute index. ie could assign two attributes (lat+long) to make a spatial index, what would integrate with the GEODIST function. Seems like that could use s2geometry, or a more traditional 2D index, quad/rtree etc. (particlly when saying SELECT *,GEODIST(lat,long,0.4,0.4) AS dist FROM index WHERE dist<20 for example, would use the spatial index) ... that's a highly specialized (and common!) application of kNN.
... which then once you have that might then expand into implementing generalized kNN, using some sort of multi-dimensional data structure.
It might even lead to something like columnar which appears (in part) to be a pluggable library that allows implementing an attribute index.
A knn library could be a 'engine' and a index bunch of vectors in an attribute (to manticore, could perhaps appear as a JSON attribute!) - that stores and indexes high-dimensional vectors. HNSW might turn out the best indexing system, but there is lots of research in the space. suspect there are indexing solutions that would be more applicable to manticore.
I suggest to consider s2geometry library https://github.com/google/s2geometry
Perhaps it worth taking a step back and looking at what trying to implement here. I guess comes about as this is a feature in ElasticSearch?
I'm sorry the issue wasn't clear from the beginning. This task doesn't have anything to do with geo search, it's only about NLP in the first place and numerical KNN search in the second place. If the result can be also applicable to geo search - great, but if it won't - no problem. The typical use cases of what's meant in this issue are:
- higher search relevance by KNN reranking of top N documents after BM25 ranking.
- Input: query + search option
ranker=bm25_knn
or smth like that - Output: result set ranked better than with the existing rankers
- Input: query + search option
- similar documents search:
- Input: document id, field(attr) name(s)
- Output: result set with similar documents
https://github.com/nmslib/nmslib seems to be a good starting point.
If wanting spatial indexing, seems it would be best implemented as an attribute index. ie could assign two attributes (lat+long) to make a spatial index, what would integrate with the GEODIST function. Seems like that could use s2geometry, or a more traditional 2D index, quad/rtree etc.
Good idea and I think we'll come to it eventually to make geo search more performant, but it's out of the scope of this issue (sorry again it wasn't clear from the beginning).
An integration with https://github.com/facebookresearch/faiss is worth researching.
@sanikolaev is anyone actively researching this for integrations?
@jacobfriedman
is anyone actively researching this for integrations?
No, the team is currently focused on other things. Feel free to do it. We'll be glad to assist.
Is it possible to support Percolate mode for KNN search?
Is it possible to support Percolate mode for KNN search?
In theory yes, it should be possible as soon as Manticore supports KNN in general.
@sanikolaev
https://opensearch.org/docs/latest/search-plugins/knn/index/
Just want to know the progress of KNN support, or are there any other solutions based on Manticore can do in production.
I think we can close this issue as a duplicate of https://github.com/manticoresoftware/manticoresearch/issues/1415.
That is in progress and in the roadmap (https://roadmap.mnt.cr/), we've been working actively on that lately.
The KNN functionality has been implemented and is available in the dev packages.
Hello, this is great! Is it possible to create and configure plain tables for KNN too? From the documentation I see "create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' );" I have been fiddling with the syntax for source and index definition in the config file but couldnt get it to work. Thanks.
here is an example test_276 of knn setup for RT index in plain mode, ie you could setup and use RT index with that data type. That option is the same for the plain index.
However there is no float_vector
attribute at the plain index as there is no source that could support that data type for now. If you need that for plain index it could be better to create separate feature request.
Hi, thanks for your response. Having a hard time trying KNN both in plain or rt mode in the searchd,
With plain: index test_vec { type = rt path = /var/lib/manticore/test_vec rt_field = title rt_attr_float_vector = vec knn = {"attrs":[{"name":"vec","type":"hnsw","dims":3,"hnsw_similarity":"L2","hnsw_m":16,"hnsw_ef_construction":200}]} }
and no data_dir definition in searchd.
it wont restart manticore unless I comment out the rt_attr_float_vector and knn lines. Nothing on the log or systemctl status or journal hinting at the problem.
With RT: Removing any table definition in config and adding data_dir to searchd, the daemon starts and then:
mysql -P9306 -h127.0.0.1 Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 6 Server version: 6.2.12 dc5144d35@230822 (columnar 2.2.4 5aec342@230822) (secondary 2.2.4 5aec342@230822) git branch manticore-6.2.12...origin/manticore-6.2.12
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' ); ERROR 1064 (42000): P03: expected 'id', got 'image_vector' near 'float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' )'
for plain mode you need to stop the daemon and delete RT index every time you change the index definition in the config. As after RT index was created daemon does not use config definition but the runtime info and data index stored into the .meta file
I also think that version of daemon 6.2.12 dc5144d@230822 do not support knn and float _vector You need to use package from the dev repository or in case you build daemon on your own to use master branch along with extra package that contains knn library
It was the version! ill try the plain mode tomorrow but surely it works too. thanks buddy.
mysql -P9306 -h127.0.0.1 Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 6.2.13 86ffc28df@240104 dev (columnar 2.2.5 1d1e432@231204) (secondary 2.2.5 1d1e432@231204) (knn 2.2.5 1d1e432@231204) git branch master...origin/master
Copyright (c) 2000, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> create table test ( title text, image_vector float_vector knn_type='hnsw' knn_dims='4' hnsw_similarity='l2' ); Query OK, 0 rows affected (0.01 sec)