vector-db-benchmark icon indicating copy to clipboard operation
vector-db-benchmark copied to clipboard

Low precision numbers reported for filtering dataset with Opensearch

Open igniting opened this issue 7 months ago • 3 comments

I was investigating why the precision reported with Opensearch is low even after introducing efficient filtering in https://github.com/qdrant/vector-db-benchmark/pull/167 with arxiv-titles-384-angular-filters dataset. I observed two issues:

  1. We were not ingesting all vectors in opensearch due to a mismatch in id type in dataset. This is because in payload we have id field as both numerical and string format. This caused Opensearch to first understand id as numerical and then reject documents which have string ids. Sample payloads:
589331:{"id":1501.02582,"submitter":"Stefano Mancini","authors":"Evgeny V. Shchukin and Stefano Mancini","title":"Quantum tomography and nonlocality","comments":"23 pages, 2 figures, contribution to the special issue of Physica\n  Scripta 
celebrating 150 years of Margarita and Vladimir I. Man'ko","journal-ref":"Physica Scripta 90 (2015) 074019","doi":null,"report-no":null,"categories":"quant-ph math-ph math.MP","license":"http:\/\/arxiv.org\/licenses\/nonexclusive-distrib\/
1.0\/","abstract":"  We present a tomographic approach to the study of quantum nonlocality in\nmultipartite systems. Bell inequalities for tomograms belonging to a generic\ntomographic scheme are derived by exploiting tools from convex geo
metry. Then,\npossible violations of these inequalities are discussed in specific tomographic\nrealizations providing some explicit examples.\n","versions":[{"version":"v1","created":"Mon, 12 Jan 2015 09:44:37 GMT"}],"update_date":"2015-06
-23","authors_parsed":[["Shchukin","Evgeny V.",""],["Mancini","Stefano",""]],"update_date_ts":1435017600,"labels":["quant-ph","math-ph","math.MP"]}                                                                                            
1889775:{"id":"cs\/0607036","submitter":"Stefano Mancini","authors":"Roland Hildebrand, Stefano Mancini, and Simone Severini","title":"Combinatorial laplacians and positivity under partial transpose","comments":"19 pages, 7 eps figures, fi
nal version accepted for publication in\n  Math. Struct. in Comp. Sci","journal-ref":"Math. Struct. in Comp. Sci. Vol.18, pp.205-219 (2008)","doi":null,"report-no":null,"categories":"cs.CC quant-ph","license":null,"abstract":"  Density mat
rices of graphs are combinatorial laplacians normalized to have\ntrace one (Braunstein \\emph{et al.} \\emph{Phys. Rev. A,} \\textbf{73}:1, 012320\n(2006)). If the vertices of a graph are arranged as an array, then its density\nmatrix carr
ies a block structure with respect to which properties such as\nseparability can be considered. We prove that the so-called degree-criterion,\nwhich was conjectured to be necessary and sufficient for separability of\ndensity matrices of gr
aphs, is equivalent to the PPT-criterion. As such it is\nnot sufficient for testing the separability of density matrices of graphs (we\nprovide an explicit example). Nonetheless, we prove the sufficiency when one of\nthe array dimensions h
as length two (for an alternative proof see Wu,\n\\emph{Phys. Lett. A}\\textbf{351} (2006), no. 1-2, 18--22).\n  Finally we derive a rational upper bound on the concurrence of density\nmatrices of graphs and show that this bound is exact f
or graphs on four\nvertices.\n","versions":[{"version":"v1","created":"Mon, 10 Jul 2006 10:38:38 GMT"},{"version":"v2","created":"Mon, 13 Nov 2006 15:47:38 GMT"},{"version":"v3","created":"Sat, 23 Jun 2007 07:11:25 GMT"}],"update_date":"20
08-07-03","authors_parsed":[["Hildebrand","Roland",""],["Mancini","Stefano",""],["Severini","Simone",""]],"update_date_ts":1215043200,"labels":["cs.CC","quant-ph"]} 
  1. For range queries we are adding gte: null and lte: null in the filters. This causes more documents to be returned than needed.
curl -H "Content-Type: application/json" "http://localhost:9200/bench/_count" --data '{"query": {"bool": {"must": [{"range": {"update_date_ts": {"lt": 1453451777, "gt": 1336998172, "lte": null, "gte": null}}}]}}}'
{"count":2138591,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

curl -H "Content-Type: application/json" "http://localhost:9200/bench/_count" --data '{"query": {"bool": {"must": [{"range": {"update_date_ts": {"lt": 1453451777, "gt": 1336998172}}}]}}}'
{"count":420379,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}

After I fixed these issues for opensearch, the precision reported jumped from 61% to 98% - which makes much more sense.

igniting avatar Jul 23 '24 11:07 igniting