k-NN [FEATURE] Inbuilt Byte quantizer to convert float 32 bits to 8 bit

Is your feature request related to a problem? Inbuilt Byte quantizer to convert float 32 bits to 8 bit. Related to https://forum.opensearch.org/t/byte-size-vector-with-neural-search-on-the-fly/16416/2

What solution would you like? Inbuilt Byte quantizer to convert float 32 bits to 8 bit.

What alternatives have you considered? Another approach could be a ingestion processor "Byte quantizer" that takes 32bit float vectors and scalar quantize to Byte vectors

Oct 25 '23 05:10 vamshin

Lucene recently added scalar quantization support inside codec: https://github.com/apache/lucene/pull/12582. Would this solve the use case?

Oct 25 '23 16:10 jmazanec15

@jmazanec15 thanks for pointing. We should have this exposed from k-NN mappings/index settings or some other way. Given Lucene already has support, we could prioritize for 2.11 launch

Oct 25 '23 16:10 vamshin

@vamshin its a newer feature. Would require some testing on our end. Pretty interesting though - they found that they could avoid recomputing scalar quantization parameters for every segment while preserving recall.

Oct 25 '23 16:10 jmazanec15

looking forward for this one guys; hopefully it makes it to v2.14.0 without having to be pushed to further releases

Mar 20 '24 16:03 Galilyou

Does this feature work with neural-search plugin out-of-box and hassle free? If I understood correctly, this feature will enable the _predict api to return quantized embeddings in, say int8, such that neural-search will automatically understand it and no need for manual quantizations

Mar 26 '24 21:03 asfoorial

Does this feature work with neural-search plugin out-of-box and hassle free?

Yes this feature will work out of box with Neural search plugin where you are using an ingest processor to convert the text to embeddings during ingestion. It works the same for neural query as well.

If I understood correctly, this feature will enable the _predict api to return quantized embeddings in, say int8, such that neural-search will automatically understand it and no need for manual quantizations

no this is not how this feature will work. So the way it will be working is, at the lowest level (aka segments), we will quantize the float 32 bit floats to int8. All you would be needing to is create the kNN vector field with right quantizer during index creation. Rest all will be take care. No changes in predict api is required.

Mar 27 '24 05:03 naveentatikonda

Does this feature work with neural-search plugin out-of-box and hassle free?

Yes this feature will work out of box with Neural search plugin where you are using an ingest processor to convert the text to embeddings during ingestion. It works the same for neural query as well.

If I understood correctly, this feature will enable the _predict api to return quantized embeddings in, say int8, such that neural-search will automatically understand it and no need for manual quantizations

no this is not how this feature will work. So the way it will be working is, at the lowest level (aka segments), we will quantize the float 32 bit floats to int8. All you would be needing to is create the kNN vector field with right quantizer during index creation. Rest all will be take care. No changes in predict api is required.

Thank you @naveentatikonda for the clarification. It is crucial to have neural query work with it seemlessly, otherwise it won't be of much use. Also for indices that don't use neural query, the _predict API will have to produce quantized vectors to avoid manual intermediate quantizations by end users.

In my current usecase, I am using a kNN index and then use the _predict API to generate vectors and also configure a default search pipeline with the same model id used when calling the _predict API. After that the users use neural query to search the index. If neural query does not understand quantized kNN indices and the _predict does not produce quantized vectors then there is no way to how the vectors are quantized and the kNN index won't be easy to search.

You may ask why I am not using ingest pipeline! because it does not support synonym resolution which is crucial in my usecase. I had to do the ingestion from outside to reflect synonyms in the generated vectors.

Mar 27 '24 06:03 asfoorial

This is a great feature and I am looking forward to use it. Is this similar to the binary quantization technique mentioned in https://huggingface.co/blog/embedding-quantization. It can produce 32x compression and maintain accuracy of above 90%.

Below is an example Java code snippet.

public static int[] binarize(float[] vec) { int[] bvec = new int[(int)Math.ceil(((float)vec.length)/8)]; int byteIndex = 0; int bitIndex = 7; byte byteValue = 0; for(int i=0;i<vec.length;i++) { int bitValue = vec[i]>0?1:0; byteValue |= bitValue << bitIndex; if(bitIndex == 0) { bvec[byteIndex] = (byteValue &0xff) -128; byteIndex++; bitIndex = 7; byteValue = 0; } else bitIndex--; } return bvec; }

The above is equivalent to the sentence_transformers.quantization.quantize_embeddings function in the below python code.

from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1") embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."]) binary_embeddings = quantize_embeddings(embeddings, precision="binary")

Jul 04 '24 10:07 asfoorial

@asfoorial binary quantization will come in https://github.com/opensearch-project/k-NN/issues/1779

Jul 05 '24 02:07 heemin32

It seems like the most straight forward way to expose Lucene's built-in scalar quantization within OpenSearch would be to allow for an encoder to configured for the knn_vector field, similar to how sq is exposed for the faiss engine

"method": {
  "name":"hnsw",
  "engine":"lucene",
  "space_type": "l2",
  "parameters":{
    "encoder": {
      "name": "sq"
    },    
    "ef_construction": 256,
    "m": 8
  }
}

The default encoder would be flat which would be the current behaviour.

There are three additional configuration options which are made available by Lucene for use which could possibly also be exposed:

confidenceInterval (float) - This allows for control of the confidence interval during quantization. It allows for two special modes
- null - Indicates that the confidence interval is dependent on the number of dimensions, the interval increasing the higher the number of dimensions. This is the default
- 0 - Indicates that the interval should be dynamically determined based on sampling
bits (int) - the number of bits to use for the quantization (between 1 and 8 inclusively). Defaults to 7
compress (boolean) - controls whether to compress values to a single byte when bits <= 4. Defaults to true

I think if you were to expose additional options, control of the confidenceInterval and bits makes the most sense. compress seems like it could just always be kept as true. The defaults for confidenceInterval and bits seem reasonable and could be used for the defaults for OpenSearch as well when sq is enabled. Below would be my suggested way of exposing it:

"encoder": {
  "name": "sq",
  "parameters": {
    "confidence_interval": "dimension"
  }
}

"encoder": {
  "name": "sq",
  "parameters": {
    "confidence_interval": "dynamic"
  }
}

"encoder": {
  "name": "sq",
  "parameters": {
    "confidence_interval": 0.3
  }
}

"encoder": {
  "name": "sq",
  "parameters": {
    "bits": 7
  }
}

I also think though that this feature could be exposed without exposing the additional configuration as well and just using the Lucene defaults.

Jul 12 '24 23:07 jhinch

It seems like the most straight forward way to expose Lucene's built-in scalar quantization within OpenSearch would be to allow for an encoder to configured for the knn_vector field, similar to how sq is exposed for the faiss engine

yes @jhinch, planning to do something similar to keep it consistent with Faiss UX.

"encoder": {
  "name": "sq",
  "parameters": {
    "confidence_interval": 0.3
  }
}

nit: accepted values for confidenceInterval are null, 0, >=0.9 && <=1.0

Jul 12 '24 23:07 naveentatikonda

Currently, if I configure a knn-vector index to have a type of "byte" instead of "float" then do I have to supply byte-quantized vectors or can I supply float32 to OpenSearch and expect it to perform the quantization itself?

Jul 14 '24 21:07 Garth-brick

Currently, if I configure a knn-vector index to have a type of "byte" instead of "float" then do I have to supply byte-quantized vectors or can I supply float32 to OpenSearch and expect it to perform the quantization itself?

@Garth-brick if you specify data_type as byte then you need to provide byte quantized vectors as input, this is the documentation for your reference.

But, after adding this feature you can provide fp32 vectors and it takes care of the quantization.

Jul 14 '24 22:07 naveentatikonda

@naveentatikonda can we close this issue?

Jul 29 '24 20:07 vamshin

@naveentatikonda can we close this issue?

yes, closing it

Jul 29 '24 23:07 naveentatikonda

k-NN k-NN copied to clipboard

[FEATURE] Inbuilt Byte quantizer to convert float 32 bits to 8 bit

k-NN
k-NN copied to clipboard