luceneutil Add tasks for multiple negated keywords and its optimized version

Add tasks for multiple negated keywords and its optimized version

Open shubhamvishu opened this issue 1 year ago • 26 comments

Added 2 new tasks for multi keyword negated queries for performance comparison.

OrNegatedHighHigh : (-A -B -C -D)
OrNegatedHighHighOpt : -(A B C D)

UPDATE : See in the comments below for new updated results

Benchmark results :

                         TaskQPS          baseline    StdDevQPS my_modified_version      StdDev        Pct diff p-value
               OrNegatedHighHigh     3700.39      (4.5%)     3704.18      (4.6%)    0.1% (  -8% -    9%) 0.943
            OrNegatedHighHighOpt     8036.98      (7.8%)     8133.66      (6.6%)    1.2% ( -12% -   16%) 0.598

Observation : OrNegatedHighHighOpt is more than 100% faster than OrNegatedHighHigh or conversely, the current implementation is >50% slower.

Full results :

                         TaskQPS          baseline    StdDevQPS my_modified_version      StdDev        Pct diff p-value
            BrowseDateSSDVFacets        1.00     (10.3%)        0.98      (5.9%)   -2.3% ( -16% -   15%) 0.385
                      AndHighLow      606.52      (2.5%)      601.62      (3.2%)   -0.8% (  -6% -    5%) 0.373
           BrowseMonthTaxoFacets        8.06     (32.7%)        8.00     (32.1%)   -0.8% ( -49% -   95%) 0.940
                         Respell       34.40      (1.7%)       34.26      (1.9%)   -0.4% (  -3% -    3%) 0.494
                      HighPhrase      141.76      (4.1%)      141.25      (4.5%)   -0.4% (  -8% -    8%) 0.788
                      AndHighMed       43.20      (2.3%)       43.07      (2.4%)   -0.3% (  -4% -    4%) 0.682
                       OrHighMed       82.32      (3.1%)       82.09      (3.2%)   -0.3% (  -6% -    6%) 0.772
                       OrHighLow      419.44      (2.5%)      418.32      (2.5%)   -0.3% (  -5% -    4%) 0.733
                       LowPhrase       17.50      (3.0%)       17.45      (2.7%)   -0.3% (  -5% -    5%) 0.773
        AndHighHighDayTaxoFacets        6.50      (2.4%)        6.49      (2.1%)   -0.2% (  -4% -    4%) 0.738
         AndHighMedDayTaxoFacets       19.27      (1.8%)       19.24      (1.4%)   -0.2% (  -3% -    3%) 0.699
                     AndHighHigh       28.91      (4.2%)       28.88      (4.4%)   -0.1% (  -8% -    8%) 0.946
                      TermDTSort       97.46      (0.8%)       97.39      (0.9%)   -0.1% (  -1% -    1%) 0.785
                          Fuzzy1       60.18      (0.9%)       60.14      (0.9%)   -0.1% (  -1% -    1%) 0.856
                    HighSpanNear        4.07      (2.1%)        4.07      (1.9%)   -0.1% (  -3% -    4%) 0.936
           HighTermDayOfYearSort      209.79      (2.4%)      209.69      (2.9%)   -0.0% (  -5% -    5%) 0.954
                        PKLookup      119.60      (2.3%)      119.60      (1.5%)   -0.0% (  -3% -    3%) 0.995
                     MedSpanNear        8.10      (2.7%)        8.10      (2.6%)    0.0% (  -5% -    5%) 0.984
                      OrHighHigh       28.27      (7.0%)       28.28      (7.0%)    0.0% ( -13% -   15%) 0.990
     BrowseRandomLabelTaxoFacets        3.25      (0.4%)        3.25      (0.6%)    0.0% (   0% -    0%) 0.836
           BrowseMonthSSDVFacets        3.25      (1.2%)        3.26      (1.0%)    0.0% (  -2% -    2%) 0.923
                          IntNRQ       42.02      (0.8%)       42.04      (0.7%)    0.0% (  -1% -    1%) 0.865
            BrowseDateTaxoFacets        3.85      (0.4%)        3.85      (0.4%)    0.1% (   0% -    0%) 0.470
       BrowseDayOfYearTaxoFacets        3.91      (0.5%)        3.91      (0.4%)    0.1% (   0% -    1%) 0.518
               OrNegatedHighHigh     3700.39      (4.5%)     3704.18      (4.6%)    0.1% (  -8% -    9%) 0.943
                       MedPhrase       27.70      (3.0%)       27.73      (3.1%)    0.1% (  -5% -    6%) 0.901
                HighSloppyPhrase        4.31      (2.8%)        4.32      (2.6%)    0.1% (  -5% -    5%) 0.881
                         Prefix3      164.53      (1.6%)      164.76      (1.4%)    0.1% (  -2% -    3%) 0.772
                     LowSpanNear       23.95      (3.7%)       23.99      (3.7%)    0.2% (  -6% -    7%) 0.890
            HighTermTitleBDVSort        7.10      (2.9%)        7.11      (2.5%)    0.2% (  -5% -    5%) 0.845
                        Wildcard       57.24      (2.5%)       57.35      (3.2%)    0.2% (  -5% -    6%) 0.823
                          Fuzzy2       27.23      (1.7%)       27.29      (1.5%)    0.2% (  -2% -    3%) 0.654
            HighIntervalsOrdered        4.08      (2.4%)        4.09      (3.0%)    0.3% (  -4% -    5%) 0.735
                 LowSloppyPhrase       50.72      (2.9%)       50.88      (2.4%)    0.3% (  -4% -    5%) 0.721
       BrowseDayOfYearSSDVFacets        3.17      (1.5%)        3.18      (0.9%)    0.3% (  -2% -    2%) 0.422
             MedIntervalsOrdered        3.05      (1.7%)        3.06      (2.0%)    0.3% (  -3% -    4%) 0.586
               HighTermTitleSort      134.59      (1.9%)      135.05      (2.4%)    0.3% (  -3% -    4%) 0.625
                 MedSloppyPhrase        5.20      (3.1%)        5.22      (2.7%)    0.4% (  -5% -    6%) 0.676
                    OrNotHighLow      390.59      (2.0%)      392.65      (2.0%)    0.5% (  -3% -    4%) 0.410
            MedTermDayTaxoFacets       17.31      (3.5%)       17.41      (3.5%)    0.6% (  -6% -    7%) 0.617
             LowIntervalsOrdered        1.92      (3.0%)        1.93      (3.6%)    0.6% (  -5% -    7%) 0.599
          OrHighMedDayTaxoFacets        3.02      (4.4%)        3.04      (4.4%)    0.6% (  -7% -    9%) 0.658
                    OrNotHighMed      232.65      (2.1%)      234.52      (3.2%)    0.8% (  -4% -    6%) 0.348
                   OrHighNotHigh      272.03      (3.0%)      274.37      (4.0%)    0.9% (  -5% -    8%) 0.442
     BrowseRandomLabelSSDVFacets        2.29      (1.0%)        2.31      (5.1%)    1.0% (  -5% -    7%) 0.387
                         LowTerm      317.60      (2.9%)      320.85      (3.0%)    1.0% (  -4% -    7%) 0.279
                        HighTerm      268.36      (5.3%)      271.18      (5.1%)    1.0% (  -8% -   12%) 0.524
               HighTermMonthSort     2405.80      (3.7%)     2432.43      (4.0%)    1.1% (  -6% -    9%) 0.366
                   OrNotHighHigh      442.00      (3.2%)      447.16      (3.5%)    1.2% (  -5% -    8%) 0.268
            OrNegatedHighHighOpt     8036.98      (7.8%)     8133.66      (6.6%)    1.2% ( -12% -   16%) 0.598
                         MedTerm      389.70      (4.3%)      394.98      (4.3%)    1.4% (  -6% -   10%) 0.321
                    OrHighNotLow      265.18      (3.1%)      268.92      (5.0%)    1.4% (  -6% -    9%) 0.281
                    OrHighNotMed      260.05      (3.6%)      265.02      (4.4%)    1.9% (  -5% -   10%) 0.132

I'm working on fixing this with BQ.rewrite. Will open a issue/PR in Lucene once I have something concrete or I'll create one straight away. We can use the added tasks in this PR to test if the rewrite change is making the performance of queries with many negated keywords comparable to its optimized version.

@mikemccand : If this change looks good, maybe we could also add negated query tasks to other tasks files other than for wikimediumall?

Mar 22 '24 12:03 shubhamvishu

luceneutil luceneutil copied to clipboard

Add tasks for multiple negated keywords and its optimized version

luceneutil
luceneutil copied to clipboard