lucene
lucene copied to clipboard
GH#11601: Add ability to compute reader states after refresh
This PR is only meant as a starting point for a conversation and not to be merged as is.
It is a proposal to let users retrieve new SortedSedDocValuesReaderState
objects after doing a refresh in a single method call to the ReferenceManager
. The new reader states will have updated ordinal maps.
Problems this solves:
- Getting reader states that correspond to the current
IndexReader
. We ensure this by using therefreshLock
. - Avoid a circular dependency from the core module on the facets module by using the
StateCalculator
functional interface.
Issue https://github.com/apache/lucene/issues/11601
How is it different from making a subclass/alternative of ReaderManager and just put the state calculation inside refreshIfNeed
?
How is it different from making a subclass/alternative of ReaderManager and just put the state calculation inside
refreshIfNeed
?
I think there’s two things that would be different. Please correct me if I’m wrong.
- If we simply override
refreshIfNeeded
, we don’t have a way to return the new reader states. We would still need to create some new method that does the refresh and returns the reader states. - Doing this in the
ReferenceManager
makes it work for bothReaderManager
andSearcherManager
with the rightStateCalculator
for each. I don’t know how valuable that is though.
For 1. I think we can always create a ReferenceManager of IndexReaderAndState
so that those two are always refreshed together and you can acquire them together.
For 2. I think people are discussing about deprecating SearcherManager right now (because we're having a timeout per indexSearcher which makes it not reusable) so I think there's not much value in it, but I could be wrong..
Thinking some more about this change in light of the comments, I see 3 ways it could go:
- Do nothing. The user can already instantiate new reader states after a refresh if they want. No new API is required.
- What this PR tried - offer an API to refresh and compute new reader states as one operation. @mikemccand points out this is likely not useful.
- Allow the user to update the ordinal maps in the reader states they already have without requiring them to completely recreate the reader states. I’m not sure how much this accomplishes. The Javadoc suggests that the bulk of the work in instantiating a SortedSetDocValuesReaderState is building the ordinal map. In that case, updating the ordinal maps instead of recreating the reader states doesn’t have much benefit.
Do any of these seem like reasonable approaches to address the original issue? If not, how else could we approach this?
3. Allow the user to update the ordinal maps in the reader states they already have without requiring them to completely recreate the reader states. I’m not sure how much this accomplishes. The Javadoc suggests that the bulk of the work in instantiating a SortedSetDocValuesReaderState is building the ordinal map. In that case, updating the ordinal maps instead of recreating the reader states doesn’t have much benefit.
I think this is not a great option (in-place update) because it'd break Lucene's point-in-time semantics. It would be nice to be able to create a new OrdinalMap
for a refreshed reader by passing in the old OrdinalMap
and being more "incremental" in how the new one is created (building on, but not altering, the old one)? But that is really a separate issue and is quite complex I think :)
- Do nothing. The user can already instantiate new reader states after a refresh if they want. No new API is required.
+1 for this option -- I don't see why this is a problem for users/applications. No need to do this atomically under the refresh lock.
That makes sense. Maybe I'm not addressing the right problem. @gsmiller - as the issue's author, what do you think?
@mikemccand’s feedback made me shift focus from rebuilding ordinal maps on refresh to just making it convenient for the user to do so. SsdvReaderStatesManager
shows how that might be done.
Sorry @stefanvodita, just now coming back to this after a break.
OK, so current state-of-the-world: Right now, users have to instantiate reader state instances for each field they're using for SSDV faceting so they can provide those to SSDFacets. These reader state instances must be re-created when the index is refreshed, and users have to manage this on their own. Creating these state instances is expensive, so users need a way to keep track of them, but also refresh them when necessary. To be fair though, users have all the capabilities they need to do efficient SSDV faceting, it's just a lot to keep track of for them (in my opinion anyway).
So I think the real question here is if we can do something that makes users' lives easier w.r.t. keeping track of these states and refreshing them when necessary. I think Mike's point-of-view is that maybe this isn't such a big deal and may not be necessary?
If we do want to explore ways to make this interaction easier for users, I think FacetsConfig
is the only realistic "hook" in some sense. One issue is that there is actually no current place where we can know what SSDV fields in the index are being used for faceting. We can't rely on the "dim config" exposed by FacetsConfig since users are not required to register any specific dim config for a faceting field if they're happy with the defaults.
I wonder if one option to explore is to have a FacetsConfig
instance keep track of the SSDV fields being used for faceting when it builds docs? That's the one place we actually know that an SSDV field is being added for the purpose of faceting. If the FacetsConfig
instance kept track of SSDV faceting fields in this way, we could add a helper method to FacetsConfig
that takes an IndexReader
as input and creates a new set of reader states. Then, we could consider a different ctor for SSDVFacets that takes a FacetsConfig
instance instead of taking a state instance directly, and the FacetsConfig
instance could provide the correct state instance. I think there's some complexity in here though since a single FacetsConfig
instance could be used with different index readers while a refresh is happening, so it would probably need to keep track of a state instance for each SSDV field + IndexReader or something. I don't know. This might all just be too complicated to be worth it, but it would be simpler for users if all they had to do was provide their FacetsConfig
instance to SSDVFacets and then call some method on that FacetsConfig
instance whenever they refresh their index.
Thanks for the feedback @gmiller! The way you describe the state-of-the-world makes perfect sense to me, I think we’re on the same page there.
What I was trying to do with SsdvReaderStatesManager
is similar to what you suggest. I definitely made that harder to get across by not removing code from old revisions, but I’ve fixed that now.
Highlights from the new revision:
- In
FacetsConfig
I’ve made it so that we store and are able to recall which fields are used for faceting. -
SsdvReaderStatesManager
stores and retrieves reader states. You mentioned doing this inFacetsConfig
, but it felt like that added a lot of new responsibility to the class, so I decided to keep it separate. Let me know if you still think it should go intoFacetsConfig
.
we could consider a different ctor for SSDVFacets that takes a FacetsConfig instance
I didn’t do this part because AFAICT to make a call to a Facets implementation, the user needs to select a field. Right now, they do this by encapsulating the field in the reader state. If the user passed a SsdvReaderStatesManager
instead of a reader state, they would also need to pass the field name. It seems easier to just pass the reader state.
a single FacetsConfig instance could be used with different index readers while a refresh is happening
Interesting! I’ve accounted for this in SsdvReaderStatesManager
, but haven’t yet implemented a “forgetting” mechanism for old IndexReader
s. I think we can have at most 2 active IndexReader
s, right? If so, we set something up along the lnes of LiveFieldValues
, with an “old” map and a “current” map.