vespa
vespa copied to clipboard
Sorting giving incorrect results
Describe the bug
Query1: The below gives 4000 lexid sorted by id field as expected
{ "hits" : 0, "model.searchPath" : "/0", "yql" : "select '[docid]' from sources * where !(range( myDate_t, -Infinity, Infinity )) AND (range(date,960892397000,1085589357000)) order by '[docid]' asc limit 4000 offset 0", "timeout" : "120s" }
Query2: The below gives 3000 lexid sorted by id field
{ "hits" : 0, "model.searchPath" : "/0", "yql" : "select '[docid]' from sources * where !(range( myDate_t, -Infinity, Infinity )) AND (range(date,960892397000,1085589357000)) order by '[docid]' asc limit 3000 offset 0", "timeout" : "120s" }
Below are the observations that we dont expect with default top-k-probability
- id ranked 1-2855 from output of Query1 are ranked same in Query2
- id ranked 2896-3027 from output of Query1 are ranked 2856-2887 in Query2
- id ranked 3082 from output of Query1 are ranked 2888 in Query2
- id ranked 3096-3107 from output of Query1 are ranked 2889-3000 in Query2
Expected behavior Ranking should nearly be the same for both queries
Environment (please complete the following information):
- Rhel8
- Podman
Vespa version 8.221.29
Is this repeatable? Is coverage 100% in both cases? Could you try with top-k-probability set to 1.0?
How do I set it to 1.0 without setting any value for max-hits-per-partition?
<tuning><dispatch><top-k-probability>1.0</top-k-probability></dispatch><searchnode>....
Invalid XML according to XML schema, error in services.xml: element "top-k-probability" not allowed here; expected the element end-tag or element "max-hits-per-partition" [98:40]
<tuning><dispatch><max-hits-per-partition /><top-k-probability>1.0</top-k-probability></dispatch><searchnode>....
character content of element "max-hits-per-partition" invalid; must be an integer
@bratseth Coverage is 100%, I have shared response with trace level with you over secure channel Also, unable to set max-hits-per-partition, see comment above
This works just fine:
<tuning>
<dispatch>
<top-k-probability>1.0</top-k-probability>
</dispatch>
</tuning>
@nehajatav Could provide the output of the following command? The utility must be executed on a container node.
vespa-get-config -n vespa.config.search.dispatch -i feed/component/dispatcher.<insert content cluster name here> | grep topKProbability
There are also slightly different count for total number documents in the two dumps: Total documents "3k": 31056000 Total documents "4k": 31056022
The dumps provided indicates that the top-k setting has not been correctly propagated. There is a slightly skew in the distribution of hits, with the node 2 reporting more hits than 0 and 1. The slight change in ordering was caused by additional hits from node 2 that have a lexical ordering lower than the highest in the 3k dump.
@bjorncs the total count may be due to increasing docs in the cluster
@bratseth was able to push top-k 1.0 but still the same result
This is the result even after convergence across all nodes with top k set to 1.0
[vespa@vespa-container-03 /]$ vespa-get-config -n vespa.config.search.dispatch -i feed/component/dispatcher.
@nehajatav The command you listed does not include the content cluster name as suffix to config id.
$ vespa-get-config -n vespa.config.search.dispatch -i feed/component/dispatcher. |grep topKProbability
You can use vespa-configproxy-cmd
to determine the available config instances at a node:
$ vespa-configproxy-cmd | grep "feed/component/dispatcher"
Use the output to determine the exact arguments to vespa-get-config
.
If the config still contains 0.9999
the change to services.xml has not been applied.