elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Adding support for hex-encoded byte vectors on knn-search

Open pmpailis opened this issue 1 year ago • 3 comments

This PR updates the parsing of the query_vector param in both knn-search & knn-query to support hex-encoded byte vectors. This means that the following 2 requests are now equivalent (same goes for knn query) and would yield the same results.

POST my_index/_search
{
    "knn":{
        "query_vector": [64, 10, -30],
        "field": "my_vector_byte",
        "k": 10,
        "num_candidates": 100
    },
    "size": 10
}
POST my_index/_search
{
    "knn":{
        "query_vector": "400ae2",
        "field": "my_vector_byte",
        "k": 10,
        "num_candidates": 100
    },
    "size": 10
}

Same parsing is also taking place during indexing, so similarly, we now support both of the following (equivalent) formats

POST my_index/_doc
{
    "my_vector_byte": [64, -10, -30]
}
POST my_index/_doc
{
    "my_vector_byte": "40f6e2"
}

pmpailis avatar Feb 12 '24 12:02 pmpailis

Documentation preview:

github-actions[bot] avatar Feb 12 '24 12:02 github-actions[bot]

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Feb 12 '24 12:02 elasticsearchmachine

Hi @pmpailis, I've created a changelog YAML for you.

elasticsearchmachine avatar Feb 12 '24 12:02 elasticsearchmachine

@elasticmachine update branch

pmpailis avatar Feb 20 '24 11:02 pmpailis

merge conflict between base and head

elasticmachine avatar Feb 20 '24 11:02 elasticmachine

@elasticmachine update branch

pmpailis avatar Feb 21 '24 22:02 pmpailis

Posting a small recap after the latest commit, as things have changed somewhat and there were discussions in different comments:

  • A new VectorData record has been introduced, which: * defines its own toXContent and parseXContent methods * extends Writeable and specifies serialization & de-serialization * parses all incoming vectors as float vectors, except for hex-encoded values which are directly loaded as bytes[] * has a asFloatVector and asByteVector that both try to convert the underlying vector to the desired type. float -> byte conversion could throw based on ElementType.BYTE.checkVectorBounds * when serializing to older nodes we always just write float arrays * when reading from older nodes we always load from floats
  • createExactKnnQuery and createKnnQuery have been updated to expect a VectorData and act accordingly

Things under discussion:

  • whether we want to support converting to float when a user has provided a hex vector (have to also consider desired bwc for this)
  • serialization to older nodes (this has been addressed with the latest commit, but very much depends on the decision on the above)

pmpailis avatar Feb 23 '24 09:02 pmpailis

@elasticmachine update branch

pmpailis avatar Mar 11 '24 08:03 pmpailis

run elasticsearch-ci/part-1

pmpailis avatar Mar 11 '24 09:03 pmpailis

@elasticmachine update branch

pmpailis avatar Mar 11 '24 10:03 pmpailis

@elasticmachine update branch

pmpailis avatar Mar 12 '24 09:03 pmpailis

run elasticsearch-ci/part-1

pmpailis avatar Mar 12 '24 10:03 pmpailis

The following tests are currently failing, most likely as a side-effect of another test (LoggerTests) updating the log-level for the root logger.

Tests with failures:
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testIndexNotFoundExceptionLogging
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testFullSnapshotUnassignedShards
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testIllegalArgumentExceptionLogging
 - org.elasticsearch.snapshots.SnapshotResiliencyTests.testSnapshotNameAlreadyInUseExceptionLogging

Once this PR is merged, we can proceed with merging this one as well.

pmpailis avatar Mar 12 '24 16:03 pmpailis

Thanks everyone for the thorough reviews and the discussions ❤️

pmpailis avatar Mar 13 '24 07:03 pmpailis