lucene GH#11601: Add ability to compute reader states after refresh

This PR is only meant as a starting point for a conversation and not to be merged as is. It is a proposal to let users retrieve new SortedSedDocValuesReaderState objects after doing a refresh in a single method call to the ReferenceManager. The new reader states will have updated ordinal maps.

Problems this solves:

Getting reader states that correspond to the current IndexReader. We ensure this by using the refreshLock.
Avoid a circular dependency from the core module on the facets module by using the StateCalculator functional interface.

Issue https://github.com/apache/lucene/issues/11601

Sep 18 '22 12:09 stefanvodita

How is it different from making a subclass/alternative of ReaderManager and just put the state calculation inside refreshIfNeed?

Sep 26 '22 06:09 zhaih

How is it different from making a subclass/alternative of ReaderManager and just put the state calculation inside refreshIfNeed?

I think there’s two things that would be different. Please correct me if I’m wrong.

If we simply override refreshIfNeeded, we don’t have a way to return the new reader states. We would still need to create some new method that does the refresh and returns the reader states.
Doing this in the ReferenceManager makes it work for both ReaderManager and SearcherManager with the right StateCalculator for each. I don’t know how valuable that is though.

Sep 26 '22 20:09 stefanvodita

For 1. I think we can always create a ReferenceManager of IndexReaderAndState so that those two are always refreshed together and you can acquire them together. For 2. I think people are discussing about deprecating SearcherManager right now (because we're having a timeout per indexSearcher which makes it not reusable) so I think there's not much value in it, but I could be wrong..

Sep 26 '22 20:09 zhaih

Thinking some more about this change in light of the comments, I see 3 ways it could go:

Do nothing. The user can already instantiate new reader states after a refresh if they want. No new API is required.
What this PR tried - offer an API to refresh and compute new reader states as one operation. @mikemccand points out this is likely not useful.
Allow the user to update the ordinal maps in the reader states they already have without requiring them to completely recreate the reader states. I’m not sure how much this accomplishes. The Javadoc suggests that the bulk of the work in instantiating a SortedSetDocValuesReaderState is building the ordinal map. In that case, updating the ordinal maps instead of recreating the reader states doesn’t have much benefit.

Do any of these seem like reasonable approaches to address the original issue? If not, how else could we approach this?

Oct 06 '22 00:10 stefanvodita

3. Allow the user to update the ordinal maps in the reader states they already have without requiring them to completely recreate the reader states. I’m not sure how much this accomplishes. The Javadoc suggests that the bulk of the work in instantiating a SortedSetDocValuesReaderState is building the ordinal map. In that case, updating the ordinal maps instead of recreating the reader states doesn’t have much benefit.

I think this is not a great option (in-place update) because it'd break Lucene's point-in-time semantics. It would be nice to be able to create a new OrdinalMap for a refreshed reader by passing in the old OrdinalMap and being more "incremental" in how the new one is created (building on, but not altering, the old one)? But that is really a separate issue and is quite complex I think :)

Do nothing. The user can already instantiate new reader states after a refresh if they want. No new API is required.

+1 for this option -- I don't see why this is a problem for users/applications. No need to do this atomically under the refresh lock.

Oct 12 '22 14:10 mikemccand

That makes sense. Maybe I'm not addressing the right problem. @gsmiller - as the issue's author, what do you think?

Oct 18 '22 13:10 stefanvodita

@mikemccand’s feedback made me shift focus from rebuilding ordinal maps on refresh to just making it convenient for the user to do so. SsdvReaderStatesManager shows how that might be done.

Dec 11 '22 12:12 stefanvodita

Sorry @stefanvodita, just now coming back to this after a break.

OK, so current state-of-the-world: Right now, users have to instantiate reader state instances for each field they're using for SSDV faceting so they can provide those to SSDFacets. These reader state instances must be re-created when the index is refreshed, and users have to manage this on their own. Creating these state instances is expensive, so users need a way to keep track of them, but also refresh them when necessary. To be fair though, users have all the capabilities they need to do efficient SSDV faceting, it's just a lot to keep track of for them (in my opinion anyway).

So I think the real question here is if we can do something that makes users' lives easier w.r.t. keeping track of these states and refreshing them when necessary. I think Mike's point-of-view is that maybe this isn't such a big deal and may not be necessary?

If we do want to explore ways to make this interaction easier for users, I think FacetsConfig is the only realistic "hook" in some sense. One issue is that there is actually no current place where we can know what SSDV fields in the index are being used for faceting. We can't rely on the "dim config" exposed by FacetsConfig since users are not required to register any specific dim config for a faceting field if they're happy with the defaults.

I wonder if one option to explore is to have a FacetsConfig instance keep track of the SSDV fields being used for faceting when it builds docs? That's the one place we actually know that an SSDV field is being added for the purpose of faceting. If the FacetsConfig instance kept track of SSDV faceting fields in this way, we could add a helper method to FacetsConfig that takes an IndexReader as input and creates a new set of reader states. Then, we could consider a different ctor for SSDVFacets that takes a FacetsConfig instance instead of taking a state instance directly, and the FacetsConfig instance could provide the correct state instance. I think there's some complexity in here though since a single FacetsConfig instance could be used with different index readers while a refresh is happening, so it would probably need to keep track of a state instance for each SSDV field + IndexReader or something. I don't know. This might all just be too complicated to be worth it, but it would be simpler for users if all they had to do was provide their FacetsConfig instance to SSDVFacets and then call some method on that FacetsConfig instance whenever they refresh their index.

Mar 10 '23 16:03 gsmiller

Thanks for the feedback @gmiller! The way you describe the state-of-the-world makes perfect sense to me, I think we’re on the same page there.

What I was trying to do with SsdvReaderStatesManager is similar to what you suggest. I definitely made that harder to get across by not removing code from old revisions, but I’ve fixed that now.

Highlights from the new revision:

In FacetsConfig I’ve made it so that we store and are able to recall which fields are used for faceting.
SsdvReaderStatesManager stores and retrieves reader states. You mentioned doing this in FacetsConfig, but it felt like that added a lot of new responsibility to the class, so I decided to keep it separate. Let me know if you still think it should go into FacetsConfig.

we could consider a different ctor for SSDVFacets that takes a FacetsConfig instance

I didn’t do this part because AFAICT to make a call to a Facets implementation, the user needs to select a field. Right now, they do this by encapsulating the field in the reader state. If the user passed a SsdvReaderStatesManager instead of a reader state, they would also need to pass the field name. It seems easier to just pass the reader state.

a single FacetsConfig instance could be used with different index readers while a refresh is happening

Interesting! I’ve accounted for this in SsdvReaderStatesManager, but haven’t yet implemented a “forgetting” mechanism for old IndexReaders. I think we can have at most 2 active IndexReaders, right? If so, we set something up along the lnes of LiveFieldValues, with an “old” map and a “current” map.

Mar 11 '23 19:03 stefanvodita

lucene lucene copied to clipboard

GH#11601: Add ability to compute reader states after refresh

lucene
lucene copied to clipboard