luceneutil icon indicating copy to clipboard operation
luceneutil copied to clipboard

PKLookup should sometimes mix in keys that don't exist

Open dsmiley opened this issue 6 years ago • 6 comments
trafficstars

The PKLookupTask appears to be an unsound benchmark. It chooses some number of IDs purely at random. Much of the time, the IDs will be purely before or after the actual IDs in the indexed field, and will not likely match anything at all. This is not realistic!

Furthermore, there is some bug that insists PKLookup is run even when the Competition is configured with task patterns that match other tasks. The expectation is that only the tasks matching the configured patters is run.

dsmiley avatar Mar 11 '19 20:03 dsmiley

Hmm the IDs that it chooses to look up should all exist in the index. If anything, we should be doing the opposite -- mixing in some IDs that do not exist, for a more balanced test.

There is a pk boolean (default True) that you pass to Competitor that you can use to turn on/off the PKLookup task.

mikemccand avatar Mar 12 '19 16:03 mikemccand

IDs are taken at random: https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/PKLookupTask.java#L60 I agree that a realistic tests should have some IDs not match; some before & after the range of actual IDs.

Thanks for pointing out the 'pk' boolean! I wish it didn't exist so that PKLookup needn't be special-cased from how other tasks are chosen.

dsmiley avatar Mar 13 '19 17:03 dsmiley

+1 to make PKLookup be a normal task so you can easily include/exclude it like normal.

And +1 to sometimes mix in IDs that don't exist.

mikemccand avatar Mar 17 '19 20:03 mikemccand

Am I incorrect that it chooses IDs to query at random and thus is not likely to choose any IDs that are actually indexed? Up above, I referenced a line from the file.

dsmiley avatar Apr 03 '19 20:04 dsmiley

Yeah so that line is randomly picking an ID that should always exist in the index. Here's where we index the IDs: https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/LineFileDocs.java#L521-L533

We atomically increment a counter, up to maxDoc, and convert that int to a string form for the ID.

So then at search time we pick a random int up to maxDoc and add it to the set of IDs to lookup.

So every one of them should exist in the index -- were you seeing evidence otherwise?

mikemccand avatar Apr 03 '19 20:04 mikemccand

Doh! Thanks for clarifying what's going on; I was very confused by this. That line of code looked different to me. I temporarily added a bit more logging in printResults to confirm all IDs match.

Earlier I was twiddling with an optimization in UniformSplit's seekExact in which it could return false early if the input ID is beyond the last term -- a bit of metadata known in advance. The benchmarks will tilt massively back & forth (+42% with this check, -6% without this check) based on wether this is here or not, which led me to think perhaps most IDs were not actually found. I know this theory is wrong now. https://github.com/apache/lucene-solr/pull/633/files#diff-0c002cbd943b5ddc5048533f4b1b3e34R177

Perhaps the explanation is that this benchmark indexes documents with increasing IDs (sorted) and so this optimization will early-terminate segments easily?

dsmiley avatar Apr 25 '19 09:04 dsmiley