k-NN icon indicating copy to clipboard operation
k-NN copied to clipboard

[FEATURE] Reuse KNNVectorFieldData for reduce disk usage

Open luyuncheng opened this issue 11 months ago • 1 comments

Description

in some scenarios, we want to reduce the disk usage and io throughput for the source field. so, we would excludes knn fields in mapping which do not store the source like( this would make knn field can not be retrieve and rebuild)

"mappings": { 
  "_source": { 
    "excludes": [
      "target_field1",
      "target_field2",
     ]
  }
}

so I propose to use doc_values field for the vector fields. like:

POST some_index/_search
{
  "docvalue_fields": [
    "vector_field1",
    "vector_field2",
  ],
  "_source": false
}'

Proposal

  1. Rewrite KNNVectorDVLeafFieldData get data from docvalues

i rewrite KNNVectorDVLeafFieldData and make KNN80BinaryDocValues can return the specific knn docvalue_fields like: (vector_field1 is knn field type)

"hits":[{"_index":"test","_id":"1","_score":1.0,"fields":{"vector_field1":["1.5","2.5"]}},{"_index":"test","_id":"2","_score":1.0,"fields":{"vector_field1":["2.5","1.5"]}}]

optimize result: 1m SIFT dataset, 1 shard, with source store: 1389MB without source store: 1055MB(-24%)

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like this issue comments for redesign a KnnVectorsFormat

  1. composite vector field to _source

I added KNNFetchSubPhase and add a processor like FetchSourcePhase#FetchSubPhaseProcessor to combine the docvalue_fields into _source something like synthetic logic

Do you have any additional context? This talk at issue #1087 and there is some other ideas My PR is #1571

for the continues dive in to knndocvalues fields, I think when use faiss engine, we can use reconstruct_n interface to retrieve the specific doc values and save the disk usage for BinaryDocValuesFormat. or like #1087 we can use KnnVectorsFormat.

BUT The idea I want to show is just reduce the disk usage and there is a issue https://github.com/opensearch-project/OpenSearch/issues/6356 talked about it, and as far as possible keep the source which reindex needed. I think the PR #1571 just reduce the disk usage and keep the source like a synthetic way

luyuncheng avatar Mar 20 '24 16:03 luyuncheng

I think we are going to need to push this to 2.15.

jmazanec15 avatar Apr 30 '24 17:04 jmazanec15