elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Track segment information in (mixed) dense vector search

Open tteofili opened this issue 10 months ago • 4 comments

Related to #106591 , a good point was raised that in case there're bugs or concerns about a given KNN query running against a "mixed" set of segments (e.g. partly flat and partly hnsw) it would be hard to debug where the problem comes from. To this end it'd be useful to have some way to track segment info in this context and e.g. be able to relate failures / warnings / slowness to specific segments.

tteofili avatar Mar 29 '24 14:03 tteofili

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine avatar Mar 29 '24 14:03 elasticsearchmachine

one thing we could do is start by adding information from Lucene SegmentInfo#codec within ES Engine class to expose which kinds of underlying data structures are used within each segments (including KnnVectorFormat) within the Index Segments API.

tteofili avatar Apr 05 '24 10:04 tteofili

another option is to enable tracking vector formats in AbstractKnnVectorQuery#explain so that the Explanation also contains per-doc vector format. This would help in situations were mappings have been updated (e.g. from hnsw to int8_hnsw) but most of the knn query results still come from segments with pre-update formats.

tteofili avatar Apr 22 '24 11:04 tteofili

in addition to the per-field KnnVectorFormat information recorded on the ES side (from mappings), Lucene can provide proper per-segment, per-field KnnVectorFormat (read from the segments), see PR.

update: this PR superseeds the Lucene one, as what we need is already available in FieldInfo.

tteofili avatar May 14 '24 14:05 tteofili