solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-4587: integrate lucene-monitor into solr

Open kotman12 opened this issue 1 year ago • 7 comments

https://issues.apache.org/jira/browse/SOLR-4587

Description

The module hopes to simplify distribution and scaling query-indexes for monitoring and alerting workflows (also known as reverse search) by providing a bridge between solr-managed search index and lucene-monitor's efficient reverse search algorithms.

Here is some evidence that the community might find this useful.

  1. Blog-post that partly inspired the current approach
  2. Users asking about a percolator-like feature on stackoverflow.
  3. Someone contributed this extension but it doesn't really provide percolator-like functionality and because it wasn't upstreamed it fell out of maintenance.
  4. Plug for my own question on the issue!

Solution

This is still a WiP but I am opening up as a PR to get community feedback. The current approach is to ingest queries as solr documents, decompose them for perfromance, and then use child-document feature to index the decomposed subqueries under one atomic parent document block. On the search side the latest approach is to use a dedicated component that creates hooks into lucene-monitor's Presearcher, QueryTermFilter and CandidateMatcher.

The current optional cache implementation uses caffeine instead of lucene-monitor's simpler ConcurrentHashMap. It's worth noting that this cache should likely be quite a bit larger than your average query or document cache since query parsing involves a non-trivial amount of compute and disk I/O (especially for large results and/or queries). It's also worth noting that lucene-monitor will keep all the indexed queries cached in memory with in its default configuration. A unique solr-monitor feature was the addition of a bespoke cache warmer that tries to populate the cache with approximately all the latest updated queries since the last commit. This approach was added to have a baseline when comparing with lucene-monitor performance. The goal was to make it possible to effectively cache all queries in memory (since that is what lucene-monitor enables by default) but not necessarily require it.

Currently the PR has some visitor classes in the org.apache.lucene.monitor package that exposes certain lucene-monitor internals. If this approach gets accepted then the lucene project will likely need to be updated to expose what is necessary.

Tests

  1. testMonitorQuery: basic functionality before and after an update
  2. testNoDocListInResponse: The current API allows for two types of responses, a special monitorDocuments response that can relay lucene-monitor's response structure and unique features such as "reverse" highlights. The other response structure is a regular solr document list with each "response" document really referring to a query that matches the "real" document that is being matched. This test ensures you can disable the solr document list from coming in the response.
  3. testDefaultParser: validate that solr-monitor routes to default parser when none is selected.
  4. testDisjunctionQuery: validate that subqueries of a disjunction get indexed seperately.
  5. testNoDanglingDecomposition: validate that deleting a top-level query also removes all the child disjuncts.
  6. testNotQuery
  7. testWildCardQuery
  8. testDefaultQueryMatchTypeIsNone: If no match type is selected with the monitorMatchType field then only a solr document list is returned (same behavior as "forward" search).
  9. testMultiDocHighlightMatchType: Test highlight matcher on a multi-document batch and ensure it returns the character offsets and positions of all individual matches. It is worth noting that percolator returns the actual matching text snippet. This is something we could consider supporting within solr or adding to lucene-monitor.
  10. testHighlightMatchType: Single doc highlight test. Slightly different than the one above in that the highlighted field does not need to be storeOffsetsWithPositions="true" which is pretty convenient. I am not sure if I am relying on a MemoryIndex implementation detail but it is a bit tedious for users to update their schemas to have storeOffsetsWithPositions="true" just to get character offsets back from the highlight matcher. I also don't know if there is a better way to handle the multi-doc case .. maybe break each doc into its own MemeoryIndex reader so that we got offsets by default without specifying storeOffsetsWithPositions="true"?
  11. manySegmentsQuery: The cache warmer has reader-leaf-dependent logic so this was included to verify everything works on a multi-segment index.

All of the above are also tested with below custom configurations:

  1. Parallel matcher - lucene monitor allows for running the final, most-expensive matching step in a multi-threaded environment. The current solr-monitor implementation allows for this with some restrictions. For instance, it is difficult to populate a document response list from a fully asynchronous matching component because it would require awkwardly opening and closing leaf collectors on-demand. The more idiomatic solr approach would be to just run this on many shards and gain parallelism as recommended here. Still, during testing I found that a fully async postfilter in a single shard had better performance than an equally-parallel multi-sharded, synchronous postfilter so I've decided to keep it in the initial proposal. On top of that, it helps achieve greater feature parity with lucene-monitor (which obviously has no concept of sharding so can only parallelize with a special matcher).
  2. Stored monitor query - allow storing queries with stored="true" instead of using the recommended docValues. docValues have stricter single-value size limits so this is mainly to accommodate humongous queries

I'll report here that I also have some local performance tests which are difficult to port but that helped guide some of the decisions so far. I've also "manually" tested the custom tlog deserialization of the derived query field but this verification should probably go somewhere in a TlogReplay test. I haven't gone down that rabbit hole yet as I wanted to poll for some feedback first. The reason I skip TLog for the derived query fields is because these fields wrap a tokenstream which in itself is difficult to serialize without a custom analyzer. The goal was to let users leverage their existing document schema as often as possible instead of having to create something custom for the query-monitoring use-case.

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the main branch.
  • [x] I have run ./gradlew check. TODO some apparently unrelated test failures
  • [x] I have added tests for my changes.
  • [ ] I have added documentation for the Reference Guide

kotman12 avatar Apr 01 '24 17:04 kotman12

So I was trying to learn how the main configuration bits fit together here and high-level the reverse search idea and my solr-monitor-naive-dinner-demo branch (or #2421 diff) off this pull request's branch is a side effect of that and my understanding so far based on it is that:

  • the in-memory state is in the Presearcher object in the ReverseQueryParserPlugin class object (and in the solr-monitor-naive-dinner-demo i just used a simple Monitor object instead of the Presearcher object)
  • the state is updated via the MonitorUpdateRequestProcessor i.e. saved searches are added as MonitorQuery objects to the Monitor object (and updating of the Presearcher object is a bit different)
  • the state is accessed via the ReverseSearchComponent component (currently non-distributed but conceptually distributed would work too?)

Is that basic understanding correct? As a next step I might go learn more about the Presearcher itself.

cpoerschke avatar Apr 25 '24 17:04 cpoerschke

So I was trying to learn how the main configuration bits fit together here and high-level the reverse search idea and my solr-monitor-naive-dinner-demo branch (or #2421 diff) off this pull request's branch is a side effect of that and my understanding so far based on it is that:

  • the in-memory state is in the Presearcher object in the ReverseQueryParserPlugin class object (and in the solr-monitor-naive-dinner-demo i just used a simple Monitor object instead of the Presearcher object)
  • the state is updated via the MonitorUpdateRequestProcessor i.e. saved searches are added as MonitorQuery objects to the Monitor object (and updating of the Presearcher object is a bit different)
  • the state is accessed via the ReverseSearchComponent component (currently non-distributed but conceptually distributed would work too?)

Is that basic understanding correct? As a next step I might go learn more about the Presearcher itself.

I'll give the PR a look but on an architectural level it is similar from what you describe. The custom update processor adds the saved search to some stateful component. The reverse search component takes a solr doc and converts it to a lucene query. It then runs that document-in-query-form against the stateful component to find the matching saved searches. And when I first looked at this I wanted to use the monitor as the stateful component. But quickly some problems emerged with that idea. My main concerns wiring a Monitor straight into solr were:

  1. Handling commit/rollback and what to update the tlog with if you also writing to a "sidecar" monitor object?
  2. Handling persistence. Currently the Monitor has its own tightly sealed index. It can be configured for persistence but if you want to peek at the segments a monitor is writing to disk it might not be easy, especially to handle configurations like tlog+pull. The alternative is to use only the in-memory Monitor configurations but that has limitations and takes away precious resources from the {cacheId -> deserialized query} cache.
  3. Bringing me to my final point that the cache a Monitor object wraps is a simple ConcurrentHashMap and the Monitor itself is updated with a very coarse-grained lock that can block reads for a long time (because it synchronizes the map with the index). It just doesn't feel like it "jives" with the solr approach to concurrency that is much more sophisticated (it is a fully fledged db after all). We could make the Monitor cache more configurable in the upstream lucene monitor repo but in my opinion lucene monitor tries to do too much state-management that its not that good at but the most valuable thing to take advantage of is the sophisticated reverse search methods (query decomposition for faster matching, query tokenization for pre-search, term weighting, optimized document-to-query conversion with term-acceptor, and probably something else I am forgetting).

kotman12 avatar Apr 25 '24 20:04 kotman12

@cpoerschke I was reading your last comment more carefully and I want to stress that the presearcher, once constructed, should be completely stateless in the current proposal. The whole point of extracting the internal bits of lucene-monitor was to avoid its cumbersome internal state management that doesn't really fit nicely into solr (at least with anything I've been able to come up with). The presearcher merely exposes the methods to efficiently convert queries into documents (and vice versa) which make reverse search faster. That is why it is used by both MonitorUpdateRequestProcessor and ReverseQueryParser.

kotman12 avatar Apr 29 '24 13:04 kotman12

Thanks for the notes above, they helped me continue browsing the code to see how the Presearcher fits in and how the various objects are currently put together.

So with indexing (via MonitorUpdateRequestProcessor) and searching (via ReverseSearchComponent which uses ReverseQueryParser) both using the Presearcher but the presearcher actually being stateless then I wonder why there's a shared presearcher and not separate ones for indexing and searching. Ease of configuration might be a reason? Or could one imagine a scenario where indexing and searching wish to use different presearcher types or settings?

And then next I looked to understand more on how the ReverseQueryParser[Plugin] and SimpleQueryParser objects fit into the picture. Here I was temporarily a little confused, the ReverseQueryParser seems to do little parsing but more post-processing on the json-to-document parsing done by ReverseSearchComponent.documentBatch previously and SimpleQueryParser seems to wrap the default parser. Okay, but leaving that side-thought aside, back to the Presearcher focus of my session -- if there wasn't the ReverseQueryParser[Plugin] objects then what might own the Presearcher object? Both MonitorUpdateRequestProcessor and ReverseSearchComponent need to be configured by neither seems an obvious owner for a presearcher used by both.

So next then considering the ReverseSearchComponent more closely:

  • ReverseSearchComponent.process jumps out as being a no-op.
  • Upon further consideration this appears to be because use-together-with the QueryComponent is assumed, is that correct?
  • If the ReverseSearchComponent must be used together with the QueryComponent then wondering what's the pros/cons of that versus ReverseSearchComponent extends QueryComponent inheritance instead.

And if the ReverseSearchComponent itself and use of it must be configured e.g. <arr name="last-components"> <str>reverseSearch</str> </arr> in the request handler, then might an alternative be to have a reverse search handler class (with in-built default components) and with ownership of the presearcher?

public class ReverseSearchHandler extends SearchHandler {

  private Presearcher presearcher;

  @Override
  protected List<String> getDefaultComponents() {
    ArrayList<String> names = new ArrayList<>(2);
    names.add(QueryComponent.COMPONENT_NAME);
    names.add(ReverseSearchComponent.COMPONENT_NAME);
    return names;
  }

  @Override
  @SuppressWarnings("unchecked")
  public void inform(SolrCore core) {
    presearcher = ...
  }
}

All just some thinking-aloud type notes here really, as a next step I'll probably look further at the ReverseQueryParser[Plugin] and SimpleQueryParser classes.

cpoerschke avatar May 01 '24 15:05 cpoerschke

Ease of configuration might be a reason? Or could one imagine a scenario where indexing and searching wish to use different presearcher types or settings?

I think it would be difficult to support defined behavior if a different presearcher configuration was used for indexing vs search. I am sure there is some scenario where this is useful but I think it is tricky to implement. Moreover, lucene's Monitor shares the same presearcher instance for both sides of its API.

Here I was temporarily a little confused, the ReverseQueryParser seems to do little parsing but more post-processing on the json-to-document parsing done by ReverseSearchComponent.documentBatch previously ..

I am confused too! I didn't pay a lot of attention to this but I'll try to see if I can remove these two bits and just stick the logic in the ReverseSearchComponent. I'll reply otherwise if I can think of any good reason for this superfluous separation of duties that really seem very related.

ReverseSearchComponent extends QueryComponent inheritance instead.

I think I like this idea a lot and am willing to try it out unless you beat me to it! I must say I also very much like the neatly-packaged ReverseSearchHandler idea which is actually well-defined in code and reduces the need for boilerplate config of potential users.

SimpleQueryParser seems to wrap the default parser.

Yes it's mainly there to be a SolrQueryRequest which the parser needs to satisfy its contract. I know its a hack so I'm open to a better idea of wrapping this like maybe making the SimpleQueryRequest an inner class? The benefit of this parser wrapper is that you avoid creating this "dummy" solr request in the two places that currently call it. Tangentially, I'd like to also support additional query parameters on top of the query string, i.e. the df default field param, but I haven't gotten to that yet (you'll notice that SimpleQueryParser::getParams method is empty for now). Basically, there could be a scenario where the SimpleQueryParser::parse(String, SolrCore, SolrParams) might drift further from QParser.getParser(String, SolrQueryRequest). SimpleQueryParser is probably a terrible name btw.

kotman12 avatar May 02 '24 00:05 kotman12

... Tangentially, I'd like to also support additional query parameters on top of the query string, i.e. the df default field param, but I haven't gotten to that yet (you'll notice that SimpleQueryParser::getParams method is empty for now). ...

Maybe I'm making a tangential comment here too but I'm wondering about scoping of the integration and if/how perhaps it could be split into phases e.g. an initial integration scope in this pull request and more advanced aspects somewhat considered in the approach here but deferred to later phases otherwise. E.g. additional query parameters and parallelism and aliasing and write-to-doc-list (or not-write-to-doc-list) sound like potentially more-advanced aspects?

cpoerschke avatar May 07 '24 12:05 cpoerschke

... Tangentially, I'd like to also support additional query parameters on top of the query string, i.e. the df default field param, but I haven't gotten to that yet (you'll notice that SimpleQueryParser::getParams method is empty for now). ...

Maybe I'm making a tangential comment here too but I'm wondering about scoping of the integration and if/how perhaps it could be split into phases e.g. an initial integration scope in this pull request and more advanced aspects somewhat considered in the approach here but deferred to later phases otherwise. E.g. additional query parameters and parallelism and aliasing and write-to-doc-list (or not-write-to-doc-list) sound like potentially more-advanced aspects?

I support whatever will make this more mergeable. Some of these features were added to make the solution more adoptable by current users of lucene-monitor so I was striving for feature parity.

Having said that, the aliasing feature specifically is fairly important and I'd probably want to make that the default/recommended approach. By aliasing the fields the presearcher creates you guarantee schema compatibility plus you get other benefits like compatibility with multi-pass presearcher.

Having said "having said that" I'll defer to what you think is best because getting anything merged is a selling point for the viability of this approach. From the feedback I've gotten so far from the other solr committee members it sounds like they want some real users before the maintainability-threshold is reached. On the other hand I've talked to some potential users who were skeptical about using an experimental solr plugin. With a chicken and egg scenario like that, even just talking about merging is great marketing 🙂

kotman12 avatar May 09 '24 01:05 kotman12

I'm really excited about this PR because that looks like a great feature to use. Are there any updates on this?

zzBrunoBrito avatar Nov 05 '24 13:11 zzBrunoBrito

I'm really excited about this PR because that looks like a great feature to use. Are there any updates on this?

Hey @zzBrunoBrito , thanks for the comment! We are currently testing out a fork of this change internally in my company. Would you be willing to test this yourself as a beta user? Your feedback could drive decisions about the direction of this change, i.e. what features to include/exclude. Let me know what you need to test this module. If you can check out the branch and build it you should be able to pick up the solr-monitor-10.X.X-SNAPSHOT.jar from the solr/modules/monitor/build/libs/ directory and then add it to your solr process's classpath. I'll try to think of an easier way of distributing this change if that sounds too complicated.

The reason I ask for you to test it is because one of the blockers for merging this was the limited scope. "Solr-monitor" is a somewhat niche feature so the community is a little hesitant to commit to maintaining yet another module unless there is at least some interest in it (which is why your comment is much appreciated).

Maybe we can communicate about this on Solr's slack channel since there might be a bit of back-and-forth getting this to work for you at the start. I will eventually document this change but was putting that part off since I wasn't sure anyone would read it 😄

kotman12 avatar Nov 07 '24 00:11 kotman12

Hi @kotman12 sorry for my late reply I wasn't able to check github lately and thank you for responding it. Yes I have interest in using it, that's a very interesting feature for me and has lots of potential to power impactful features. I'll follow your instructions to use it and I'm happy to provide any feedback from that. I'm curious to test that and see how it works

zzBrunoBrito avatar Dec 06 '24 16:12 zzBrunoBrito

Hey @zzBrunoBrito .. thanks for replying. I've added some general docs in this branch that could help you out. I plan to add it to the refguide but I'm still working through some stuff.

Until this is upstreamed though and solr picks up lucene 9.12 (which is in the works) you will have to do this hack:

  1. Build the module and copy build/packaging/saved-search/ into solr-9.7.0/modules/ of your solr installation. Alternatively, you can use the one I just built to get started: saved-search.zip. You might notice the strange name aa-solr-saved-search-10.0.0-SNAPSHOT.jar .. this is a temporary hack to load saved-search classes before lucene-monitor classes. This won't be necessary once lucene 9.12 is picked up.
  2. You can pretty much follow the instructions of Basic Configuration in the README but any time you see a classpath pointing to a new class like solr.SavedSearchUpdateProcessorFactory .. you will have to replace the resource path with solr.savedsearch.update.* or solr.savedsearch.cache.* or solr.savedsearch.search.* depending on the class. That is unless you want to build the whole solr distribution from scratch .. because the resource loader name mapping is in solr-core which I've updated in this PR but is obviously not something you'll pick up if you only patch in the saved-search jar.
  3. Run bin/solr start -e cloud -Dsolr.modules=saved-search

kotman12 avatar Dec 07 '24 14:12 kotman12