https://issues.apache.org/jira/browse/SOLR-17195

Description

Prefix-based queries consume memory in proportion to the number of terms in the index that start with the prefix. Short prefixes tend to match many more indexed terms, and consume more memory as a result, often causing instability issues on the node.

Yet Solr (prior to this PR) offers no way to restrict the prefixes used in queries.

Solution

This PR adds a solrconfig.xml property, minPrefixLength, which operates similarly to the existing maxBooleanClauses. Users who submit a query with a prefix shorter than the minimal acceptable length will receive an error of the form:

Query <snip> does not meet the minimum prefix length [2] (actual=[1]).  Please try with a larger prefix, or adjust minPrefixLength in your solrconfig.xml

Some notes on the implementation:

currently only enforced for 'string' and 'text' fields, where this cardinality problem occurs most frequently
defaults to '2' in the default configset, prohibiting only single-character prefixes.
can be overridden in the default configset with the solr.min.prefixLength sysprop/env-var

Tests

Tests for solrconfig.xml in SolrCoreTest. Tests for the limiting itself in PrefixQueryTest.

Checklist

Please review the following and check all that apply:

[x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
[x] I have created a Jira issue and added the issue ID to my pull request title.
[x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
[x] I have developed this patch against the main branch.
[x] I have run ./gradlew check.
[x] I have added tests for my changes.
[x] I have added documentation for the Reference Guide

Jun 05 '24 15:06 gerlowskija

Three questions I'd like to answer before this is merged:

I can imagine scenarios where an advanced user might want to enable this feature, but still be able to exempt certain queries. (Perhaps they know that certain fields are low-cardinality enough to be "safe", etc.) Is it worth adding a query-param feature flag, that can be used to disable the limit on a query-by-query basis?
Some faceting codepaths run prefix queries in the course of processing certain types of facets. Should we enforce this limit on these codepaths or skip it?
Is there any value in expanding the limit to cover field types other than 'string' and 'text'?

Other TODOs:

~ref-guide docs~
~additional testing~
~tidy/check.~

Jun 05 '24 15:06 gerlowskija

I wish we brainstormed this a bit on the dev list or something prior to you showing up with a PR where there is maybe inertia to continue with it the way it is

Jun 06 '24 17:06 dsmiley

I don't really think that's a fair characterization. The JIRA ticket was open and advertised to all of us on issues@ 3 months ago, and no one chimed in there.

I could've created a dev@ thread to discuss approaches. And now knowing your preference, maybe I will next time. But obviously not every JIRA needs a dev@ thread. I used my best judgement. Sometimes the brainstorming and feedback just doesn't appear until there's a 'patch' to stimulate it, so I wrote one 🤷

In any case, I'm happy to discuss/brainstorm the overall approach now. I think it stands on its own without considering "inertia".

In short, my thought process was that the more broadly applicable the limit is, the more effective it will be at protecting stability. It's hard to achieve stability if the safeguards you're relying on are opt-in, "off" by default, or only apply to a subset of relevant queries. So I wanted to make the "minPrefix" limit be opt-out, apply broadly, etc.

Solr already has a query-time limit that fits this "broadly applicable" model pretty well, also in service of stability: maxBooleanClauses. mBC has caused its share of issues, but the underlying model is sound and has been a part of Solr for at least a decade. Users are already familiar with this sort of limit coming in under solrconfig.xml, etc.

So that's mostly what I've tried to do in this PR: (1) make it "broad" by design, and (2) follow mBC as a model in terms of how to configure, document, and implement.

@dsmiley - You noted above your preference for something parser-specific. Can you elaborate on that preference a bit?

Jun 07 '24 17:06 gerlowskija

I'm sorry for challenging your due diligence on soliciting input.

In short, my thought process was that the more broadly applicable the limit is, the more effective it will be at protecting stability. It's hard to achieve stability if the safeguards you're relying on are opt-in, "off" by default, or only apply to a subset of relevant queries. So I wanted to make the "minPrefix" limit be opt-out, apply broadly, etc.

This runs the real risk of an obscure limit showing itself in production that would hot have been caught in a test setup for a use-case that should absolutely not have such a limit. Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser where I don't think such a limit at all makes sense (at least not by default). Random examples -- delete-by-query, facet.query, join query.

The same is so for maxBooleanClauses! I was hoping you wouldn't bring that up but of course you did; it's the precedent for your decision. That unfortunate precedent caused a huge ugly debate with only one person holding veto power that insisted on maxBooleanClauses as it exists today. That might have made Yonik leave to do Heliosearch over, if there was a huge straw break that back. I hope we don't repeat that mistake. I've only ever increased maxBooleanClauses to something insane, not because there isn't a risk of users doing expensive queries that we want to limit, but instead because the correct (IMO) place to enforce query constraints is at the query parser. And we do this where I work for various things that in-effect constraint the query complexity.

@markrmiller you and most others were involved in that ugly debate... what do you think of Jason's approach here?

Jun 07 '24 18:06 dsmiley

Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser

I'm not sure the edismax=user-queries, lucene=system-queries generalization is a safe one to make.

And even if it is, I think you're making an additional assumption here that "system" queries wouldn't also benefit from this check. I'm not sure I agree. System-devs are more likely than the average user to know about Solr's performance pitfalls, but is that safe to assume across the board? I think our many years with the project put us in danger of overestimating the knowledge and context other users might have, and underestimating the value of safeguards.

That's the key issue for me IMO - if I was more comfortable with those generalizations I'd be happy with an edismax fix. But as-is they feel like pretty shaky ground to base cluster-stability on.

I've only ever increased maxBooleanClauses to something insane, not because there isn't a risk of users doing expensive queries that we want to limit

maxBooleanClauses is, admittedly, a pain. For one, it's terribly calibrated, or at least terribly out of date. (Is 1024 clauses even a lot on today's hardware? It seems like nothing!). I'm not here to defend it. And I can't speak to the old debate without reading up on it. (Do you have a pointer handy @dsmiley ?)

But conceptually I think maxBooleanClauses takes the right response to a "DANGER" stimulus. If something has the potential to take down a cluster a global block is the safest thing to do. It's not ideal, but in the absence of other data to block in a more targeted manner, stability is worth prioritizing. Solr should be stable by default, even if that means admins have a little extra work to do to "unlock" certain query patterns. That's mostly what I was trying to convey by the comparison to mBC

Jun 10 '24 16:06 gerlowskija

Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser

Thinking about this a bit more, I wonder if the real issue here is that there's no way to tell Solr unequivocally, on a request-by-request basis: "Don't second-guess me, I know this is safe". If users had a way to communicate that, there wouldn't be any need for us as developers to try to generalize about which queries came from "end" users, what users may or may not know about Solr performance, etc.

Would adding something like that make you feel any better about the parser-agnostic approach @dsmiley ? It'd give savvy "system" users a way to sidestep the "hitting a limit on a safe query in prod" false-positive you expressed concern about above.

Jun 10 '24 16:06 gerlowskija

“what do you think of Jason's approach here?”

It is a kind of thorny practice that evades a good definitive judgment.

I suppose the idea is, at some, perhaps fairly arbitrary limit, you fail things to force the user to reckon with what they are doing more carefully and actively raise limits if they think that's the correct action. I say perhaps fairly arbitrary because it's difficult to do any better, given varying hardware and data.

The cost is that is that things that worked fine in testing may start blowing up randomly in production, perhaps needlessly, until you address the situation. Thorny. Benifit / reward, what lines are drawn where, it's all pretty subjective. So I don't know.

As a user, I certainly would want to be able to opt in, though yes, that makes the effort likely of little value, particularly if it's opt-in per limit. A universal opt-in would be better but would likely still end up as little value then. As a user, I'll still be pissed when production starts failing things the first time, and I judge it should not be doing that for my case. And at that point, I would be very unhappy if there was not a universal opt-out. Thorny.

I do see the value, but in most cases, that value would not likely be high enough to convince me as a user that this was a good default system practice. Other than for very particular things, it doesn't seem like a common practice in other systems because of how thorny it is.

I think some limits have a bit more clear value, though. Things that can build up over time and then are very difficult to address. Something along the lines of no limit to the number of fields on a document, where things can work fine for the first 6 months, but over time, you are using more and more fields and eventually the system just kind of seizes under the weight, and you can't do much but start over. Maybe hitting something much earlier that said you are using a strangely large number of fields, this is not a very scable approach, you have to change a configuration to continue down this path, would make a lot more sense.

Jun 11 '24 02:06 markrmiller

One use case where opt-in still has a lot of value is when one team runs Solr as a service or for other teams. In that case, it's much more of an available feature than automatic user self protection.

Jun 11 '24 03:06 markrmiller

@markrmiller mentioned to me offline that it was clear that my preference was for cluster-stability over convenience/admin-friendliness, but that I never made the rationale behind that preference explicit. He suggested I take one last crack at making the case for across-the-board limiting, as explicitly as possible. So here it is:

Ultimately my rationale is pretty simple: aggressive prefix queries have the potential to block all traffic if they trigger an OOM. That's a worst-case-scenario for an administrator.

There's pain for administrators either way. Realizing that some setting has errantly blocked some "known safe" queries definitely brings a serious "annoyance-factor". But to put myself in an administrator's shoes, I'd rather deal with that on 10, 20, even 100 clusters than deal with an "all traffic" outage at 3am because of a runaway prefix query.

So that's my last pitch for across-the-board limiting. I'll give this a few days to see if it sways anyone; if no one else is convinced enough to chime in or reverse course though I'll go forward with something like the more limited approach that David championed.

Jun 18 '24 14:06 gerlowskija

I've been on vacation lately. I suggested a limit based on the query parser. This suggestion isn't limited to edismax; it'd include some others that users are likely to use. Another angle is to employ this limit only on a TextField, which generally should be used for keyword search and thus user queries. I think I can be comfortable with that. But not StrField in particular (in terms of opt-in vs opt-out). That's where my concern is most likely to be used in practice.

Jun 21 '24 17:06 dsmiley

I suggested a limit based on the query parser. This suggestion isn't limited to edismax; it'd include some others that users are likely to use.

Did you have particular parsers in mind @dsmiley ? Or is there somewhere else that the project relied on this "end-user" vs "admin" parser distinction, that I could get a list from?

I'm on board with expanding the set of parsers that will enforce this limit, but I don't have a great sense what you or others would consider "end-user" parsers. If you have a concrete subset in mind, then I'm happy to run with that.

Another angle is to employ this limit only on a TextField, which generally should be used for keyword search and thus user queries.

I think I agree that "field-type" is a better predictor of "user query or not" than the query-parser would be.

Maybe my experience is idiosyncratic, but prefix queries on TextField aren't the common case from what I've seen. Anecdotally, prefix queries are much much more common on StrField.

We could apply the limit only on TextField, but if that only covers (e.g.) 20% of the prefix queries that'll hit the index - is it providing stability? I guess it still could, if the terms dictionary was large enough. But still, leaving the majority prefix-query use-case uncovered doesn't feel great... I'm just thinking-aloud/rambling at this point I guess.

Anyway, let me think through which of these two approaches (field-based vs parser-based) I like better, and I'll push up a modification shortly.

Thanks for sticking with this through your time off and everything @dsmiley ; hope you had a good trip!

Jun 25 '24 13:06 gerlowskija

The more I think through this today, the more I start to wonder whether I was wrong to oppose an "opt-in" solution originally.

It doesn't benefit all users the way an opt-out solution would, but maybe that's for the best given the high level of concern around annoyance and false-positives. Shrewd users, and folks who especially need it (e.g. the "service" use case Mark mentioned) would get a really comprehensive limit at their disposal once they enable it. And of course, it'd keep us out of the business of trying to guess which queries are likely to be "user queries".

Is that something you could get behind @dsmiley ?

Jun 25 '24 19:06 gerlowskija

One idea is that maybe we would want to have a page in the ref guide that was "The Defensive Solr Setup" and you could reference this type of feature and other features that enable this? Maybe someday we ship a configset that represents a configuration that is very protective? To make it easier for non experts to adopt this type of feature?

Jun 26 '24 12:06 epugh

"user" query parsers: complexphrase, dismax, edismax. All others seem internal to me. We could have an abstract method on QParserPlugin (or QParser) that compels plugin authors to indicate what the default prefix query limit should be (if any).

Field types: (I accept this approach too, albeit I like it less)

Is that something you could get behind @dsmiley ?

I'm unsure exactly what you are asking me. I'm definitely behind "opt-in" :-)

Jun 26 '24 13:06 dsmiley

I'm unsure exactly what you are asking me. I'm definitely behind "opt-in" :-)

Sorry - I should've been more concrete. I'm asking if you could live with a parser-agnostic (i.e. global) limit like what the PR currently implements, if it was made "opt-in" rather than "opt-out".

AFAICT that combination would address your concerns around false-positives, making work for admins, etc. And it'd be my preference over a parser-aware approach, as I'm not totally comfortable with the idea of inferring query origin from the parser implementation.

If you're -1 on any sort of a "global" limit, even as an opt-in feature, I'll implement parser-aware. But I didn't want to do that without checking where you stood on the global+optIn approach.

Jun 27 '24 17:06 gerlowskija

Alright @dsmiley - I took in most of your suggestions:

The logic for determining the limit now lives in QParser
QParser looks at the solrconfig.xml setting, but uses a local-param override if specified
The setting is now "opt-in", with the flag value "-1" used in the default configset.

Lmk what you think; hopefully this is close.

Jul 03 '24 14:07 gerlowskija

AFAIK - the last outstanding question here is whether to bundle in a separate change to make ComplexPhraseQParser be schema-aware in how it creates its queries. There's a few technical hurdles there unfortunately - so my plan (unless I hear any vetos) is to punt on that for now. Full discussion of the idea and subsequent issues here..

I'll give a little more time for feedback, but I'll look to merge sometime this week otherwise

Jul 08 '24 16:07 gerlowskija

SOLR-17195: Add 'minPrefixLength' soft limit

Description

Solution

Tests

Checklist