k-NN
k-NN copied to clipboard
[FEATURE] Inbuilt Byte quantizer to convert float 32 bits to 8 bit
Is your feature request related to a problem? Inbuilt Byte quantizer to convert float 32 bits to 8 bit. Related to https://forum.opensearch.org/t/byte-size-vector-with-neural-search-on-the-fly/16416/2
What solution would you like? Inbuilt Byte quantizer to convert float 32 bits to 8 bit.
What alternatives have you considered? Another approach could be a ingestion processor "Byte quantizer" that takes 32bit float vectors and scalar quantize to Byte vectors
Lucene recently added scalar quantization support inside codec: https://github.com/apache/lucene/pull/12582. Would this solve the use case?
@jmazanec15 thanks for pointing. We should have this exposed from k-NN mappings/index settings or some other way. Given Lucene already has support, we could prioritize for 2.11 launch
@vamshin its a newer feature. Would require some testing on our end. Pretty interesting though - they found that they could avoid recomputing scalar quantization parameters for every segment while preserving recall.
looking forward for this one guys; hopefully it makes it to v2.14.0 without having to be pushed to further releases
Does this feature work with neural-search plugin out-of-box and hassle free? If I understood correctly, this feature will enable the _predict api to return quantized embeddings in, say int8, such that neural-search will automatically understand it and no need for manual quantizations
Does this feature work with neural-search plugin out-of-box and hassle free?
Yes this feature will work out of box with Neural search plugin where you are using an ingest processor to convert the text to embeddings during ingestion. It works the same for neural query as well.
If I understood correctly, this feature will enable the _predict api to return quantized embeddings in, say int8, such that neural-search will automatically understand it and no need for manual quantizations
no this is not how this feature will work. So the way it will be working is, at the lowest level (aka segments), we will quantize the float 32 bit floats to int8. All you would be needing to is create the kNN vector field with right quantizer during index creation. Rest all will be take care. No changes in predict api is required.
Does this feature work with neural-search plugin out-of-box and hassle free?
Yes this feature will work out of box with Neural search plugin where you are using an ingest processor to convert the text to embeddings during ingestion. It works the same for neural query as well.
If I understood correctly, this feature will enable the _predict api to return quantized embeddings in, say int8, such that neural-search will automatically understand it and no need for manual quantizations
no this is not how this feature will work. So the way it will be working is, at the lowest level (aka segments), we will quantize the float 32 bit floats to int8. All you would be needing to is create the kNN vector field with right quantizer during index creation. Rest all will be take care. No changes in predict api is required.
Thank you @naveentatikonda for the clarification. It is crucial to have neural query work with it seemlessly, otherwise it won't be of much use. Also for indices that don't use neural query, the _predict API will have to produce quantized vectors to avoid manual intermediate quantizations by end users.
In my current usecase, I am using a kNN index and then use the _predict API to generate vectors and also configure a default search pipeline with the same model id used when calling the _predict API. After that the users use neural query to search the index. If neural query does not understand quantized kNN indices and the _predict does not produce quantized vectors then there is no way to how the vectors are quantized and the kNN index won't be easy to search.
You may ask why I am not using ingest pipeline! because it does not support synonym resolution which is crucial in my usecase. I had to do the ingestion from outside to reflect synonyms in the generated vectors.
This is a great feature and I am looking forward to use it. Is this similar to the binary quantization technique mentioned in https://huggingface.co/blog/embedding-quantization. It can produce 32x compression and maintain accuracy of above 90%.
Below is an example Java code snippet.
public static int[] binarize(float[] vec) { int[] bvec = new int[(int)Math.ceil(((float)vec.length)/8)]; int byteIndex = 0; int bitIndex = 7; byte byteValue = 0; for(int i=0;i<vec.length;i++) { int bitValue = vec[i]>0?1:0; byteValue |= bitValue << bitIndex; if(bitIndex == 0) { bvec[byteIndex] = (byteValue &0xff) -128; byteIndex++; bitIndex = 7; byteValue = 0; } else bitIndex--; } return bvec; }
The above is equivalent to the sentence_transformers.quantization.quantize_embeddings function in the below python code.
from sentence_transformers import SentenceTransformer from sentence_transformers.quantization import quantize_embeddings model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1") embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."]) binary_embeddings = quantize_embeddings(embeddings, precision="binary")
@asfoorial binary quantization will come in https://github.com/opensearch-project/k-NN/issues/1779
It seems like the most straight forward way to expose Lucene's built-in scalar quantization within OpenSearch would be to allow for an encoder
to configured for the knn_vector
field, similar to how sq
is exposed for the faiss
engine
"method": {
"name":"hnsw",
"engine":"lucene",
"space_type": "l2",
"parameters":{
"encoder": {
"name": "sq"
},
"ef_construction": 256,
"m": 8
}
}
The default encoder would be flat
which would be the current behaviour.
There are three additional configuration options which are made available by Lucene for use which could possibly also be exposed:
-
confidenceInterval
(float) - This allows for control of the confidence interval during quantization. It allows for two special modes-
null
- Indicates that the confidence interval is dependent on the number of dimensions, the interval increasing the higher the number of dimensions. This is the default -
0
- Indicates that the interval should be dynamically determined based on sampling
-
-
bits
(int) - the number of bits to use for the quantization (between 1 and 8 inclusively). Defaults to 7 -
compress
(boolean) - controls whether to compress values to a single byte whenbits <= 4
. Defaults totrue
I think if you were to expose additional options, control of the confidenceInterval
and bits
makes the most sense. compress
seems like it could just always be kept as true
. The defaults for confidenceInterval
and bits
seem reasonable and could be used for the defaults for OpenSearch as well when sq
is enabled. Below would be my suggested way of exposing it:
"encoder": {
"name": "sq",
"parameters": {
"confidence_interval": "dimension"
}
}
"encoder": {
"name": "sq",
"parameters": {
"confidence_interval": "dynamic"
}
}
"encoder": {
"name": "sq",
"parameters": {
"confidence_interval": 0.3
}
}
"encoder": {
"name": "sq",
"parameters": {
"bits": 7
}
}
I also think though that this feature could be exposed without exposing the additional configuration as well and just using the Lucene defaults.
It seems like the most straight forward way to expose Lucene's built-in scalar quantization within OpenSearch would be to allow for an
encoder
to configured for theknn_vector
field, similar to howsq
is exposed for thefaiss
engine
yes @jhinch, planning to do something similar to keep it consistent with Faiss UX.
"encoder": { "name": "sq", "parameters": { "confidence_interval": 0.3 } }
nit: accepted values for confidenceInterval
are null, 0, >=0.9 && <=1.0
Currently, if I configure a knn-vector index to have a type of "byte" instead of "float" then do I have to supply byte-quantized vectors or can I supply float32 to OpenSearch and expect it to perform the quantization itself?
Currently, if I configure a knn-vector index to have a type of "byte" instead of "float" then do I have to supply byte-quantized vectors or can I supply float32 to OpenSearch and expect it to perform the quantization itself?
@Garth-brick if you specify data_type
as byte
then you need to provide byte quantized vectors as input, this is the documentation for your reference.
But, after adding this feature you can provide fp32 vectors and it takes care of the quantization.
@naveentatikonda can we close this issue?
@naveentatikonda can we close this issue?
yes, closing it