lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Terminate automaton when it can match all suffixes, and match suffixes directly.

Open vsop-479 opened this issue 1 year ago • 19 comments

For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly. Furthermore, if there is a subBlock, we could match all its' sub terms.

vsop-479 avatar Feb 05 '24 07:02 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Feb 20 '24 00:02 github-actions[bot]

@jpountz Please take a look when you get a chance.

vsop-479 avatar Feb 21 '24 09:02 vsop-479

IntersectsTermsEnum is a bit scary to me, maybe @mikemccand can take a look, I expect him to be more familiar with it.

jpountz avatar Feb 23 '24 16:02 jpountz

I will have a look -- thanks for the ping @jpountz.

mikemccand avatar Feb 27 '24 15:02 mikemccand

@mikemccand Thanks for your suggestion, I will try to implement it.

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I am working on this.

@jpountz Thanks for your reply.

vsop-479 avatar Feb 28 '24 02:02 vsop-479

I think the optimization may be similar to the one done in AutomatonTermsEnum? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L149-L153

When "ping-ponging" the term dictionary against the automaton, it tracks visited bitset and looks for such loops in the automaton. when it finds one, it temporarily acts like a TermRangeQuery.

I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too.

rmuir avatar Feb 28 '24 02:02 rmuir

I think the optimization may be similar to the one done in AutomatonTermsEnum?

Thanks for reminding that, I will dig into AutomatonTermsEnum's optimization.

vsop-479 avatar Feb 28 '24 05:02 vsop-479

There is still problem in the state of match all suffix of IntersectTermsEnumFrame. I am trying to figure it out.

On the other hand, I will dig into AutomatonTermsEnum's optimization.

vsop-479 avatar Mar 01 '24 09:03 vsop-479

@mikemccand I renamed the field used to indicate whether an accept state can match all suffixes, and detected it in RunAutomaton. Please take a look when you get a chance.

vsop-479 avatar Mar 02 '24 15:03 vsop-479

I think the optimization may be similar to the one done in AutomatonTermsEnum?

It seems the optimization of AutomatonTermsEnum is to improve iterating term mode, from seeking bytes from DFA and seek termsEnum(seekCeil), to simply sequential reads the termsEnum, after finding a loop(setLinear). In both mode, AutomatonTermsEnum needs to check if the term is accepted by running automaton.

If I am not mistaken, this optimization is different from AutomatonTermsEnum's. It directly matches all reminding suffixes and sub blocks, after detecting an accept state with a .* transition back onto itself.

Or maybe you mean we can improve AutomatonTermsEnum's optimization, to implement this optimization's effect? @rmuir

vsop-479 avatar Mar 06 '24 06:03 vsop-479

Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).

I measured it with current implementation with wikimedium1m:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3      879.20      (9.1%)      995.37      (4.2%)   13.2% (   0% -   29%) 0.062
Prefix3      924.98      (9.9%)     1042.17      (7.9%)   12.7% (  -4% -   33%) 0.083
TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
Prefix3     1480.94      (8.8%)     1559.89      (6.4%)    5.3% (  -9% -   22%) 0.195
Prefix3     1242.80      (6.9%)     1307.30      (5.3%)    5.2% (  -6% -   18%) 0.299
Prefix3      177.54      (1.3%)      202.74      (6.3%)   14.2% (   6% -   22%) 0.000

vsop-479 avatar Mar 07 '24 10:03 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Apr 13 '24 00:04 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Apr 30 '24 00:04 github-actions[bot]

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar May 24 '24 00:05 github-actions[bot]

@mikemccand I think this change is close for PrefixQuery, TermRangeQuery, and custom binary AutomatonQuery.

As for RegexpQuery and WildcardQuery, whose automata converted by UTF32ToUTF8. We just get accept transitions [0, 127] in RunAutomaton, I am not sure whether it is enough that just check the accept state is transition is [0, 127].

Another approach is set matchAllSuffix in UTF32ToUTF8, but need to add a unFundamental member to Automaton.

Or, can we just push this change for Prefix/TermRangeQuery, and leave a TODO for Regexp/WildcardQuery temporarily?

vsop-479 avatar Jun 21 '24 06:06 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jul 07 '24 00:07 github-actions[bot]

Conflicts resolved.

vsop-479 avatar Aug 12 '24 08:08 vsop-479

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Aug 28 '24 00:08 github-actions[bot]