elasticsearch
elasticsearch copied to clipboard
Adding support for hex-encoded byte vectors on knn-search
This PR updates the parsing of the query_vector param in both knn-search & knn-query to support hex-encoded byte vectors. This means that the following 2 requests are now equivalent (same goes for knn query) and would yield the same results.
POST my_index/_search
{
"knn":{
"query_vector": [64, 10, -30],
"field": "my_vector_byte",
"k": 10,
"num_candidates": 100
},
"size": 10
}
POST my_index/_search
{
"knn":{
"query_vector": "400ae2",
"field": "my_vector_byte",
"k": 10,
"num_candidates": 100
},
"size": 10
}
Same parsing is also taking place during indexing, so similarly, we now support both of the following (equivalent) formats
POST my_index/_doc
{
"my_vector_byte": [64, -10, -30]
}
POST my_index/_doc
{
"my_vector_byte": "40f6e2"
}
Pinging @elastic/es-search (Team:Search)
Hi @pmpailis, I've created a changelog YAML for you.
@elasticmachine update branch
merge conflict between base and head
@elasticmachine update branch
Posting a small recap after the latest commit, as things have changed somewhat and there were discussions in different comments:
- A new
VectorDatarecord has been introduced, which: * defines its owntoXContentandparseXContentmethods * extendsWriteableand specifies serialization & de-serialization * parses all incoming vectors as float vectors, except for hex-encoded values which are directly loaded asbytes[]* has aasFloatVectorandasByteVectorthat both try to convert the underlying vector to the desired type.float->byteconversion could throw based onElementType.BYTE.checkVectorBounds* when serializing to older nodes we always just write float arrays * when reading from older nodes we always load from floats createExactKnnQueryandcreateKnnQueryhave been updated to expect aVectorDataand act accordingly
Things under discussion:
- whether we want to support converting to
floatwhen a user has provided a hex vector (have to also consider desired bwc for this) - serialization to older nodes (this has been addressed with the latest commit, but very much depends on the decision on the above)
@elasticmachine update branch
run elasticsearch-ci/part-1
@elasticmachine update branch
@elasticmachine update branch
run elasticsearch-ci/part-1
The following tests are currently failing, most likely as a side-effect of another test (LoggerTests) updating the log-level for the root logger.
Tests with failures:
- org.elasticsearch.snapshots.SnapshotResiliencyTests.testIndexNotFoundExceptionLogging
- org.elasticsearch.snapshots.SnapshotResiliencyTests.testFullSnapshotUnassignedShards
- org.elasticsearch.snapshots.SnapshotResiliencyTests.testIllegalArgumentExceptionLogging
- org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotNameAlreadyInUseExceptionLogging
Once this PR is merged, we can proceed with merging this one as well.
Thanks everyone for the thorough reviews and the discussions ❤️