luceneutil icon indicating copy to clipboard operation
luceneutil copied to clipboard

Add an option to disable BMW optimization for benchmarks

Open shubhamvishu opened this issue 1 year ago • 6 comments

Description

  • Currently, there is no straight-forward way to disable in lucene benchmarks(?) which could be required in testing some optimizations like #258. I'd great if we could add an option/argument to disable BMW while benchmarking.

  • One idea could be to Increase TOTAL_HITS_THRESHOLD in IndexSearcher.java to Integer.MAX_VALUE. Maybe we could add a setter for the same?

Looking for more ideas on this!

shubhamvishu avatar Apr 29 '24 08:04 shubhamvishu

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

jpountz avatar Apr 29 '24 08:04 jpountz

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

+1, that's a nice approach. Though even Lucene's count() API has some nice optimizations to bypass visiting all postings / sub-linear implementations I think?

mikemccand avatar Apr 29 '24 13:04 mikemccand

Indeed IndexSearcher#count has some optimizations to bypass postings. But it was mostly an example, some cheap faceting should work too?

jpountz avatar Apr 29 '24 13:04 jpountz

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

Do you mean to wrap the clauses with "count( )" like eg https://github.com/mikemccand/luceneutil/blob/master/tasks/countOnly.tasks so that we check the performance but avoid BMW? I like this idea if I understand correctly. But not sure if we could make it an option with benchmarks straightforwardly.

 

Indeed IndexSearcher#count has some optimizations to bypass postings. But it was mostly an example, some cheap faceting should work too?

I'm not sure what you mean by using some cheap faceting here. Maybe you could elaborate on this idea? Also, since we want to enable it via benchmarks, does this also fit well in that picture?

shubhamvishu avatar Apr 29 '24 14:04 shubhamvishu

Indeed IndexSearcher#count has some optimizations to bypass postings. But it was mostly an example, some cheap faceting should work too?

I'm not sure what you mean by using some cheap faceting here. Maybe you could elaborate on this idea? Also, since we want to enable it via benchmarks, does this also fit well in that picture?

I think @jpountz is referring to enabling faceting on each task. luceneutil's TaskParser supports this with e.g. +facets:Date.sortedset. Because facets require counting all hits, it forces Lucene to disable BMW. The problem is, it also adds some cost (I think that's why @jpountz suggested finding a "cheap" one heh), which is not great because it dilutes what you are trying to measure (a change in postings decode / visit time).

Could you use tasks where dynamic pruning doesn't apply instead of disabling it? E.g. use counting tasks?

Do you mean to wrap the clauses with "count( )" like eg https://github.com/mikemccand/luceneutil/blob/master/tasks/countOnly.tasks so that we check the performance but avoid BMW? I like this idea if I understand correctly. But not sure if we could make it an option with benchmarks straightforwardly.

luceneutil supports count tasks with syntax like count(+a +b). This is parsed to use IndexSearcher's count API. I think that may be a quick workaround for benchmarking https://github.com/mikemccand/luceneutil/pull/258

mikemccand avatar Apr 29 '24 14:04 mikemccand

Thanks for the explanation, Mike! I'll try benchmarking it change using count tasks and share the results. Btw, if the above-mentioned approach of maxing out IndexSearcher.TOTAL_HITS_THRESHOLD also makes sense, then in that case I had already shared the results for it over here.

shubhamvishu avatar Apr 29 '24 14:04 shubhamvishu