luceneutil Search results not identical after rearranging

Main part of my localrun.py

  index1= comp.newIndex("lucene_baseline", sourceData,
                         analyzer='StandardAnalyzerNoStopWords',
                         postingsFormat='Lucene90',
                         idFieldPostingsFormat='Lucene90',
                         mergePolicy='TieredMergePolicy',
                         facets = (('taxonomy:Date', 'Date'),
                                   ('taxonomy:Month', 'Month'),
                                   ('taxonomy:DayOfYear', 'DayOfYear'),
                                   ('sortedset:Month', 'Month'),
                                   ('sortedset:DayOfYear', 'DayOfYear')),
                         useCMS=True,
                         numThreads=4,
                         maxConcurrentMerges=12,
                         rearrange=555,
                         addDVFields=True)

  index2 = comp.newIndex("lucene_candidate", sourceData,
                         analyzer='StandardAnalyzerNoStopWords',
                         postingsFormat='Lucene90',
                         idFieldPostingsFormat='Lucene90',
                         mergePolicy='TieredMergePolicy',
                         facets = (('taxonomy:Date', 'Date'),
                                   ('taxonomy:Month', 'Month'),
                                   ('taxonomy:DayOfYear', 'DayOfYear'),
                                   ('sortedset:Month', 'Month'),
                                   ('sortedset:DayOfYear', 'DayOfYear')),
                         useCMS=True,
                         numThreads=4,
                         maxConcurrentMerges=12,
                         rearrange=555,
                         addDVFields=True)

  comp.competitor('baseline', 'lucene_baseline',
                  index = index1, concurrentSearches = False)

  comp.competitor('my_modified_version', 'lucene_candidate',
                  index = index2, concurrentSearches = False)

And the lucene codes are identical (since they're just symlinks) But I'm getting errors below when running python src/python/localrun.py -source wikimedium10k

WARNING: cat=OrHighHigh: hit counts differ: 1387+ vs 1379+
Traceback (most recent call last):
  File "src/python/localrun.py", line 95, in <module>
    comp.benchmark("baseline_vs_patch")
  File "/Users/haoyzhai/Documents/lucene-home/luceneutil/src/python/competition.py", line 457, in benchmark
    searchBench.run(id, base, challenger,
  File "/Users/haoyzhai/Documents/lucene-home/luceneutil/src/python/searchBench.py", line 196, in run
    raise RuntimeError('errors occurred: %s' % str(cmpDiffs))
RuntimeError: errors occurred: ([], ["query=body:act filter=None sort=None groupField=None hitCount=136: hit 1 has wrong id/s ([909, 1226], '3.3034647') vs ([909, 1227], '3.3034647')", 'query=body:"jpg thumb"~4 filter=None sort=None groupField=None hitCount=299: hit 4 has wrong id/s ([1095, 3859, 4759], \'3.3647363\') vs ([1093, 3859, 4759], \'3.3647363\')', "query=+body:can +body:companies filter=None sort=None groupField=None hitCount=6: hit 0 has wrong id/s ([2412], '3.627793') vs ([2417], '3.627793')", 'query=body:"1 ref"~4 filter=None sort=None groupField=None hitCount=377: hit 5 has wrong id/s ([2513, 7085], \'1.2669312\') vs ([2513, 7083], \'1.2669312\')', "query=body:1890s~1 filter=None sort=None groupField=None hitCount=203: hit 4 has wrong id/s ([1203, 4008, 4952], '3.5986476') vs ([1203, 4007, 4950], '3.5986476')", 'query=body:"needed date" filter=None sort=None groupField=None hitCount=317: hit 1 has wrong id/s ([9297], \'3.883874\') vs ([9299], \'3.883874\')', "query=body:MAXWIDTH/10(ORDERED(2011,ref)) filter=None sort=None groupField=None hitCount=624: hit 0 has wrong id/s ([5867], '0.8888889') vs ([5868], '0.8888889')", "query=+body:after +body:may filter=None sort=None groupField=None hitCount=139: hit 6 has wrong id/s ([8672], '2.43158') vs ([8673], '2.43158')", "query=body:2009 body:listened filter=None sort=None groupField=None hitCount=1014+: hit 4 has wrong id/s ([1572], '1.9808738') vs ([1570], '1.9808738')", 'query=body:year body:since filter=None sort=None groupField=None hitCount=1387+: wrong hitCount: 1387+ vs 1379+', 'query=body:"make up"~4 filter=None sort=None groupField=None hitCount=37: hit 1 has wrong id/s ([3421, 9840], \'2.9576194\') vs ([3421, 9842], \'2.9576194\')', "query=spanNear([body:book_result, body:ct], 10, true) filter=None sort=None groupField=None hitCount=39: hit 0 has wrong id/s ([8729], '7.6768456') vs ([8730], '7.6768456')", "query=body:bad filter=None sort=None groupField=None hitCount=48: hit 0 has wrong id/s ([9342], '4.06048') vs ([9344], '4.06048')", "query=body:2006 filter=None sort=DayOfYear groupField=None hitCount=769: hit 0 has wrong id/s ([9812], '52') vs ([9814], '52')", 'query=body:"left thumb" filter=None sort=None groupField=None hitCount=35: hit 0 has wrong id/s ([3636], \'3.0522845\') vs ([3637], \'3.0522845\')', "query=spanNear([body:soon, body:after], 10, true) filter=None sort=None groupField=None hitCount=22: hit 0 has wrong id/s ([8275], '3.235054') vs ([8274], '3.235054')", "query=body:8 body:different filter=None sort=None groupField=None hitCount=1022+: hit 6 has wrong id/s ([5364, 8200], '2.7650805') vs ([5318, 8199], '2.7650805')", "query=body:marek~2 filter=None sort=None groupField=None hitCount=1004+: hit 3 has wrong id/s ([8037], '2.1000693') vs ([8038], '2.1000693')", 'query=body:"per year" filter=None sort=None groupField=None hitCount=40: hit 1 has wrong id/s ([3736], \'3.3056784\') vs ([3737], \'3.3056784\')', "query=spanNear([body:new, body:york], 10, true) filter=None sort=None groupField=None hitCount=574: hit 0 has wrong id/s ([1214], '3.8572712') vs ([1215], '3.8572712')"], 1.0)

Notice that most of the errors have id off by 1, for example: "query=spanNear([body:new, body:york], 10, true) filter=None sort=None groupField=None hitCount=574: hit 0 has wrong id/s ([1214], '3.8572712') vs ([1215], '3.8572712')"]

I tried to debug a bit by making rearrange only use 1 thread (so that the segment order is fixed as well) and printing out stats of id field including: min, max, number of unique ids, and did not find anything abnormal. In brief, the segment doc nums are 1810, 1800 * 4, 180 * 5, 18 * 5 and each segment is containing the correct ids.

Apr 19 '21 22:04 zhaih

Oh I suddenly realized we're allocating ids dynamically, so it is normal that they are different run by run

Apr 19 '21 22:04 zhaih

OK, you are right! We use an AtomicInteger to pull the next ID, so when we index with multiple threads, different IDs are assigned.

But I'm baffled why this is not normally a problem for benchmarks that must reindex. Maybe such benchmarks must always use a single thread? (Which we are trying to fix here, yay!).

OK I think we must indeed make the ID assignment also deterministic. The first doc pulled from the source is docid=0, the next is docid=1, etc.

Apr 20 '21 15:04 mikemccand

Here's a PR: https://github.com/mikemccand/luceneutil/pull/122

Apr 20 '21 16:04 mikemccand

luceneutil luceneutil copied to clipboard

Search results not identical after rearranging

luceneutil
luceneutil copied to clipboard