lucene
lucene copied to clipboard
Terminate automaton when it can match all suffixes, and match suffixes directly.
For PrefixQuery, we can terminate the automaton on current term if we have matched the whole prefix, and match this term directly. Furthermore, if there is a subBlock, we could match all its' sub terms.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@jpountz Please take a look when you get a chance.
IntersectsTermsEnum
is a bit scary to me, maybe @mikemccand can take a look, I expect him to be more familiar with it.
I will have a look -- thanks for the ping @jpountz.
@mikemccand Thanks for your suggestion, I will try to implement it.
Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).
I am working on this.
@jpountz Thanks for your reply.
I think the optimization may be similar to the one done in AutomatonTermsEnum? https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java#L149-L153
When "ping-ponging" the term dictionary against the automaton, it tracks visited
bitset and looks for such loops in the automaton. when it finds one, it temporarily acts like a TermRangeQuery.
I think, it works a bit more general than just prefixquery and also helps with regex and wildcard queries too.
I think the optimization may be similar to the one done in AutomatonTermsEnum?
Thanks for reminding that, I will dig into AutomatonTermsEnum
's optimization.
There is still problem in the state of match all suffix of IntersectTermsEnumFrame
. I am trying to figure it out.
On the other hand, I will dig into AutomatonTermsEnum's optimization.
@mikemccand
I renamed the field used to indicate whether an accept state can match all suffixes, and detected it in RunAutomaton
.
Please take a look when you get a chance.
I think the optimization may be similar to the one done in AutomatonTermsEnum?
It seems the optimization of AutomatonTermsEnum
is to improve iterating term mode, from seeking bytes from DFA and seek termsEnum(seekCeil
), to simply sequential reads the termsEnum, after finding a loop(setLinear
).
In both mode, AutomatonTermsEnum
needs to check if the term is accepted by running automaton.
If I am not mistaken, this optimization is different from AutomatonTermsEnum
's. It directly matches all reminding suffixes and sub blocks, after detecting an accept state with a .*
transition back onto itself.
Or maybe you mean we can improve AutomatonTermsEnum
's optimization, to implement this optimization's effect? @rmuir
Have you tried to measure any performance change with this? E.g. you could run a luceneutil benchy with just PrefixQuery, or, Regexp/WildcardQuery that also have this property (match-all states in their automata).
I measured it with current implementation with wikimedium1m
:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Prefix3 879.20 (9.1%) 995.37 (4.2%) 13.2% ( 0% - 29%) 0.062
Prefix3 924.98 (9.9%) 1042.17 (7.9%) 12.7% ( -4% - 33%) 0.083
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Prefix3 1480.94 (8.8%) 1559.89 (6.4%) 5.3% ( -9% - 22%) 0.195
Prefix3 1242.80 (6.9%) 1307.30 (5.3%) 5.2% ( -6% - 18%) 0.299
Prefix3 177.54 (1.3%) 202.74 (6.3%) 14.2% ( 6% - 22%) 0.000
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@mikemccand
I think this change is close for PrefixQuery
, TermRangeQuery
, and custom binary AutomatonQuery
.
As for RegexpQuery
and WildcardQuery
, whose automata converted by UTF32ToUTF8
. We just get accept transitions [0, 127] in RunAutomaton
, I am not sure whether it is enough that just check the accept state is transition is [0, 127].
Another approach is set matchAllSuffix in UTF32ToUTF8
, but need to add a unFundamental member to Automaton
.
Or, can we just push this change for Prefix/TermRangeQuery
, and leave a TODO
for Regexp/WildcardQuery
temporarily?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Conflicts resolved.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!