k-NN
k-NN copied to clipboard
[FEATURE] Reuse KNNVectorFieldData for reduce disk usage
Description
in some scenarios, we want to reduce the disk usage
and io throughput
for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)
"mappings": {
"_source": {
"excludes": [
"target_field1",
"target_field2",
]
}
}
so I propose to use doc_values field for the vector fields. like:
POST some_index/_search
{
"docvalue_fields": [
"vector_field1",
"vector_field2",
],
"_source": false
}'
Proposal
- Rewrite
KNNVectorDVLeafFieldData
get data from docvalues
i rewrite KNNVectorDVLeafFieldData
and make KNN80BinaryDocValues
can return the specific knn docvalue_fields
like: (vector_field1
is knn field type)
"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]
optimize result: 1m SIFT dataset, 1 shard, with source store: 1389MB without source store: 1055MB(-24%)
for the continues dive in to knndocvalues
fields, I think when use faiss engine, we can use reconstruct_n
interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat
. or like this issue comments for redesign a KnnVectorsFormat
- composite vector field to _source
I added KNNFetchSubPhase
and add a processor like FetchSourcePhase#FetchSubPhaseProcessor
to combine the docvalue_fields
into _source
something like synthetic
logic
Do you have any additional context? This talk at issue #1087 and there is some other ideas My PR is #1571
for the continues dive in to knndocvalues
fields, I think when use faiss engine, we can use reconstruct_n
interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat
. or like #1087 we can use KnnVectorsFormat.
BUT The idea I want to show is just reduce the disk usage and there is a issue https://github.com/opensearch-project/OpenSearch/issues/6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a synthetic
way
I think we are going to need to push this to 2.15.