lucene Terminate automaton when it can match all suffixes, and match suffixes directly.

For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly. Furthermore, if there is a subBlock, we could match all its' sub terms.

Feb 05 '24 07:02 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Feb 20 '24 00:02 github-actions[bot]

@jpountz Please take a look when you get a chance.

Feb 21 '24 09:02 vsop-479

IntersectsTermsEnum is a bit scary to me, maybe @mikemccand can take a look, I expect him to be more familiar with it.

Feb 23 '24 16:02 jpountz

I will have a look -- thanks for the ping @jpountz.

Feb 27 '24 15:02 mikemccand

@mikemccand Thanks for your suggestion, I will try to implement it.

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I am working on this.

@jpountz Thanks for your reply.

Feb 28 '24 02:02 vsop-479

I think the optimization may be similar to the one done in AutomatonTermsEnum? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L149-L153

When "ping-ponging" the term dictionary against the automaton, it tracks visited bitset and looks for such loops in the automaton. when it finds one, it temporarily acts like a TermRangeQuery.

I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too.

Feb 28 '24 02:02 rmuir

I think the optimization may be similar to the one done in AutomatonTermsEnum?

Thanks for reminding that, I will dig into AutomatonTermsEnum's optimization.

Feb 28 '24 05:02 vsop-479

There is still problem in the state of match all suffix of IntersectTermsEnumFrame. I am trying to figure it out.

On the other hand, I will dig into AutomatonTermsEnum's optimization.

Mar 01 '24 09:03 vsop-479

@mikemccand I renamed the field used to indicate whether an accept state can match all suffixes, and detected it in RunAutomaton. Please take a look when you get a chance.

Mar 02 '24 15:03 vsop-479

I think the optimization may be similar to the one done in AutomatonTermsEnum?

It seems the optimization of AutomatonTermsEnum is to improve iterating term mode, from seeking bytes from DFA and seek termsEnum(seekCeil), to simply sequential reads the termsEnum, after finding a loop(setLinear). In both mode, AutomatonTermsEnum needs to check if the term is accepted by running automaton.

If I am not mistaken, this optimization is different from AutomatonTermsEnum's. It directly matches all reminding suffixes and sub blocks, after detecting an accept state with a .* transition back onto itself.

Or maybe you mean we can improve AutomatonTermsEnum's optimization, to implement this optimization's effect? @rmuir

Mar 06 '24 06:03 vsop-479

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I measured it with current implementation with wikimedium1m:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3      879.20      (9.1%)      995.37      (4.2%)   13.2% (   0% -   29%) 0.062
Prefix3      924.98      (9.9%)     1042.17      (7.9%)   12.7% (  -4% -   33%) 0.083

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3     1480.94      (8.8%)     1559.89      (6.4%)    5.3% (  -9% -   22%) 0.195
Prefix3     1242.80      (6.9%)     1307.30      (5.3%)    5.2% (  -6% -   18%) 0.299
Prefix3      177.54      (1.3%)      202.74      (6.3%)   14.2% (   6% -   22%) 0.000

Mar 07 '24 10:03 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Apr 13 '24 00:04 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Apr 30 '24 00:04 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

May 24 '24 00:05 github-actions[bot]

@mikemccand I think this change is close for PrefixQuery, TermRangeQuery, and custom binary AutomatonQuery.

As for RegexpQuery and WildcardQuery, whose automata converted by UTF32ToUTF8. We just get accept transitions [0, 127] in RunAutomaton, I am not sure whether it is enough that just check the accept state is transition is [0, 127].

Another approach is set matchAllSuffix in UTF32ToUTF8, but need to add a unFundamental member to Automaton.

Or, can we just push this change for Prefix/TermRangeQuery, and leave a TODO for Regexp/WildcardQuery temporarily?

Jun 21 '24 06:06 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jul 07 '24 00:07 github-actions[bot]

Conflicts resolved.

Aug 12 '24 08:08 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Aug 28 '24 00:08 github-actions[bot]

lucene lucene copied to clipboard

Terminate automaton when it can match all suffixes, and match suffixes directly.

lucene
lucene copied to clipboard