NCDG@10 CUT scorer inconsistent when only 1 high scoring result
Describe the bug Using the NDCG@10 CUT scorer, for queries where we only have 1 result and this result has score 3, we sometimes get a NDCG@10 CUT score of 0.06, sometimes 0.22 and sometimes 1.00.
Expected behavior Not sure what NDCG@10 should actually return when recall<10 , so not sure whether we should return 1 or 0.22!
Screenshots
@david-fisher could you take a look?
We think this is expected, because the scorer takes into account other judged results that aren't being shown. On David's advice I created a scorer that calculates NDCG@n where n = min(10, numFound()) and that returns more consistent results for n<10. Of course, NDCG and this variant isn't a great metric in this case as a search only returning 1 result does 'better' than one returning 10 results
I again note, nDCG is Satan's lollipop, do not lick it... While it is an excellent aggregate metric for comparing two competing systems, and models cumulative information gathering, it has flaws, as Ellen noted in the slack convo.
For most commercy type searches, first good result is a good model of user need. Once you have one pair of Jimmy Choo's, your rarely want another (at the same time). Now there might be five different Jimmy Choo's that would all satisfy, any one at rank one is likely sufficient.
ERR provides a better metric for those cases, as captures the first past the post notion.