lucene Update created version major

Description

May 04 '25 04:05 rahulgoswami

As states on the dev list, I don't think we should do this.

May 04 '25 10:05 rmuir

As states on the dev list, I don't think we should do this.

Can you please elaborate why? I have tried to address the concern you brought up on the mailing list via the implementation in this PR.

The primary premise behind the API is that IF all segments of an index are created by the LATEST version, the index in all respects is LATEST. "indexCreatedVersionMajor" should ideally not block a Lucene upgrade in that case.

Happy to learn if I am missing anything.

May 04 '25 16:05 rahulgoswami

Because that's not what this variable means. If you want to change index backwards compatibility policy, lying about the version is not the way. -1

May 05 '25 08:05 rmuir

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

May 21 '25 00:05 github-actions[bot]

An index where all segments have minVersion=LATEST and version=LATEST is essentially an index with version LATEST.

Please note that this API does NOT require Lucene to perform the upgrade/reindexing part. That will be done by the upper layer (Solr, Elasticsearch, etc) , say, through a custom merge policy. The requirement from the Lucene API is "If all the segments have minVersion=version=Version.LATEST, please allow the user to work with the index irrespective of what the SegmentInfos.indexCreatedVersionMajor says".

[Approach 2] If the concern is to maintain tracking data regarding the origins of the index, SegmentInfos.indexCreatedVersionMajor could be left untouched. Instead a new property "currentIndexMajorVersion" could be introduced which sits parallel to indexCreatedVersionMajor

For a fresh index, currentIndexMajorVersion=indexCreatedVersionMajor=Version.LATEST

Upon an upgrade, if the upper layer (Solr, Elasticsearch, etc) successfully converts all segments to Version.LATEST, calling the IndexWriter.commitAndUpdateVersionCreatedMajor() (which will now be called commitAndUpdateCurrentMajorVersion() ) should set currentIndexMajorVersion=Version.LATEST, while indexCreatedVersionMajor remains untouched.

SegmentInfos.readCommit() now relies on SegmentInfos.currentIndexMajorVersion for codec compatibility checks.

@jpountz It would be valuable to get your thoughts on this too. Thank you.

Update (09/19/2025): Since https://github.com/apache/lucene/pull/15012 got merged earlier today, rewording certain parts of the requirement..."if all the segments have minVersion and version >= MIN_SUPPORTED_MAJOR , please allow the user to work with the index irrespective of what the SegmentInfos.indexCreatedVersionMajor says"

Sep 08 '25 04:09 rahulgoswami

[Approach 3] This is inspired by @msokolov 's idea at the recently concluded Community Over Code where I gave a talk on this topic ("Achieving zero downtime and zero cost index upgrade in Apache Solr").

Do we HAVE to rely on indexCreatedVersionMajor to decide index compatibility?

If in SegmentInfos.readCommit() there could be a check over the list of SegmentInfo and if all segments have minVersion and version >= MIN_SUPPORTED_MAJOR, it should be good to open. Considering that even for an average large index this would mean a few hundred iterations, would this one time check at startup time be a worthy tradeoff?

Sep 19 '25 04:09 rahulgoswami

Yes, I think your "Approach 3" is nice since it doesn't have us lying about the min created version, and it doesn't require rewriting anything immutable. It just changes our readers so they can read an index whose segments were written by compatible Lucene versions. It still prevents from reading indexes that were originally written with an index version < MIN_SUPPORTED_MAJOR.

Sep 23 '25 18:09 msokolov

Appreciate your inputs @msokolov . I will rework this PR with Approach 3 soon.

It still prevents from reading indexes that were originally written with an index version < MIN_SUPPORTED_MAJOR.

I guess you meant this already but just in the interest of complete clarity, it DOES allow reading indexes originally written with an index version < MIN_SUPPORTED_MAJOR provided that ALL of the current segments of the index have version and minVersion >= MIN_SUPPORTED_MAJOR (as opposed to relying on "indexCreatedVersionMajor" property in SegmentInfos today).

Sep 23 '25 19:09 rahulgoswami

I guess you meant this already but just in the interest of complete clarity, it DOES allow reading indexes originally written with an index version < MIN_SUPPORTED_MAJOR provided that ALL of the current segments of the index have version and minVersion >= MIN_SUPPORTED_MAJOR (as opposed to relying on "indexCreatedVersionMajor" property in SegmentInfos today).

This doesnt work. Sometimes we need to force a reindex. That's because the original thing encoded was lossy (e.g. norms) and we need to change the encoding, or because it was indexed incorrectly. all the merging in the world can't fix that.

Sep 23 '25 21:09 rmuir

This doesnt work. Sometimes we need to force a reindex. That's because the original thing encoded was lossy (e.g. norms) and we need to change the encoding, or because it was indexed incorrectly. all the merging in the world can't fix that.

For each older segment which is >=MIN_SUPPORTED_MAJOR but <LATEST, if it can be excluded from merges (eg: via a custom merge policy) and, in parallel, can be read and reindexed outside of Lucene (Eg: in Solr, etc ), the result would be a LOSSLESS index where ALL segments are LATEST version. Of course this only applies to any index where all fields are either stored or docValues enabled.

Such indexes can benefit from this kind of in-place reindexing giving the ability to keep "upgrading" even in case of any back incompatible changes like the norms change that happened in 7.x (as long as Lucene maintains readability for LATEST-1 version segments).

I did mention this basic premise for this PR on the mailing list, but in hindsight, could have re-stressed upon it. The Solr side effort is being tracked under https://issues.apache.org/jira/browse/SOLR-17725 .

Once Solr can achieve the state as mentioned above (aka all segments have been converted to LATEST version), today the only thing preventing the index from opening is the check in SegmentInfos based on indexCreatedVersionMajor (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L349).

The three approaches outlined in this PR offer alternatives to this check.

Sep 23 '25 22:09 rahulgoswami

Right, I forgot the key fact here that also came out during your presentation at ApacheCoC. Namely in the system where you want to use this, the documents have been completely reindexed from source that is stored in the index in a stored field (IIRC) so there can be no question of lingering data that was indexed by a previous version. So all the individual segments are "new version" segments and it is only the index-wide min-segment version that is preventing the reader from opening the index, which is really now just a piece of "ghost metadata" that preserves the fact that at one time there were some segments in this index that had been created by an older version of Lucene.

I think the important restriction to keep in place is that readers will still not be able to open indexes where any segment was originally written with an index version < MIN_SUPPORTED_MAJOR. The change is to have the reader make the decision based on the aggregate of the segment-level metadata (in SegmentInfo) rather than looking at the index-wide metadata (in SegmentInfos, I think).

Sep 24 '25 18:09 msokolov

@msokolov That's absolutely correct. You summarized every bit perfectly. Thank you.

So the breakdown of tasks for the complete solution looks like this:

Reindex all older segments with a custom merge policy (to prevent them from merging with LATEST version segments during the reindexing process) ==> to be done by Solr ==> Results in ALL segments being LATEST version
At index read time, perform a check to ensure none of the segments are < MIN_SUPPORTED_MAJOR. Don't rely on the immutable metadata ("indexCreatedVersionMajor") in SegmentInfos ==> to be done by Lucene.

Also, another requirement for the overall solution to work is that the user is not jumping across incompatible major versions. Eg: If an index is created in Lucene 11 (with MIN_SUPPORTED_MAJOR=10) and a breaking change happens in Lucene 13 (while MIN_SUPPORTED_MAJOR is still 10), then before moving to Lucene 14 (with now MIN_SUPPORTED_MAJOR=13), user would need to reindex on Lucene 13 via the above process. They can't jump versions, say from Lucene 12 to Lucene 14 and expect to reindex on 14 (since the index won't open for reading on 14).

Sep 25 '25 19:09 rahulgoswami

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Oct 14 '25 03:10 github-actions[bot]

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Oct 14 '25 04:10 github-actions[bot]

@msokolov @mikemccand Would appreciate if you could take a look please..

Oct 16 '25 04:10 rahulgoswami

@msokolov Thanks for taking the time to review this. I have refactored the PR with your suggestions.

Oct 25 '25 06:10 rahulgoswami

I just want to change the commit message and CHANGES doc-line to be more descriptive, something like:

Index open performs version check on each segment, ignores indexCreatedVersionMajor (#14607)

I will do this myself and push

Oct 27 '25 14:10 msokolov

I merged from command-line in order to change make the CHANGES.txt and commit message change. Maybe it would have been cleaner to do it here, sorry

Oct 27 '25 14:10 msokolov

If it would be helpful, I think we could also backport to 10x branch

Oct 27 '25 14:10 msokolov

If it would be helpful, I think we could also backport to 10x branch

yes pretty please! Solr is in the process of releasing 10x and I was hoping it could have this. Ideally Lucene 9x too if that's a possibility (wishful thinking?)

Thanks for merging this PR!

Oct 27 '25 14:10 rahulgoswami

OK, do you want to try? It's just a question of cherry-picking to branch_10x. I doubt there will be any 9x release. but you might be able to cut one for yourself?

Oct 27 '25 14:10 msokolov

Sure I'll cherry pick for 10x.

I doubt there will be any 9x release. but you might be able to cut one for yourself?

Is Lucene 9x already marked EOL ? I assumed since 11x is not out yet, 9x must still be receiving fixes (?) .

Oct 27 '25 15:10 rahulgoswami

hmm, I had to revert because of https://github.com/apache/lucene/issues/15374 ... I don't totally grok what is going on there, but it seems that we attempt to read segments with version 7.4 and fail because we don't have the backwards-compat codec

Oct 27 '25 16:10 msokolov

Investigating https://github.com/apache/lucene/issues/15374. As communicated there, seems like this is only failing for the nightly build. gradle check ran fine with the commit.

Oct 27 '25 16:10 rahulgoswami

Yeah seems like the error is since the lucene-backward-codecs.jar doesn't contain codecs for 7.x anymore. The nightly test was succeeding earlier since we were bailing out by looking at indexCreatedVersionMajor in SegmentInfos straightaway. With the new change we are looking into each segment which needs the codec to open the segment.

    Caused by:
    java.lang.IllegalArgumentException: An SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene70' does not exist.  You need to add the corresponding JAR file supporting this SPI to your classpath.  The current classpath supports the following names: [Lucene104, Asserting, CheapBastard, DeflateWithPresetCompressingStoredFieldsData, FastCompressingStoredFieldsData, FastDecompressionCompressingStoredFieldsData, HighCompressionCompressingStoredFieldsData, LZ4WithPresetCompressingStoredFieldsData, DummyCompressingStoredFieldsData, ConfigurableMCodec, SimpleText, Lucene80, Lucene84, Lucene86, Lucene87, Lucene90, Lucene91, Lucene92, Lucene94, Lucene95, Lucene99, Lucene912, Lucene100, Lucene101, Lucene103]
        at [email protected]/org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:113)
        at [email protected]/org.apache.lucene.codecs.Codec.forName(Codec.java:119)
        at [email protected]/org.apache.lucene.index.SegmentInfos.readCodec(SegmentInfos.java:528)
        ... 47 more

Oct 27 '25 16:10 rahulgoswami

I am thinking maybe this needs exception handling in the code in case of missing codec (in which case index is too old anyway) and/or corresponding change in the exception message during assert check in the test. I'll revisit this tonight. Thanks.

Oct 27 '25 17:10 rahulgoswami

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Oct 28 '25 05:10 github-actions[bot]

A little under the weather, but I don't want to lose the momentum on this. Thanks so much for helping navigate this PR so far Mike. I'll have a revision today based on our discussion.

Oct 30 '25 05:10 rahulgoswami

sorry, hope you feel better! no pressure, take your time

Oct 30 '25 14:10 msokolov

I bumped up the min version of 'segments' codec to VERSION_86 in CodecUtil.checkHeaderNoMagic(). And updated the test to be more specific. This still maintains backward compatibility with Lucene 10. And even if cherry-picked to 10, will continue to maintain backward compatibility with 9x.

The failed nightly passes fine now. Thanks for the hint @msokolov !

This helps us avoid the dilemma of a lucene-backward-codecs.jar actually not being present in the class path vs an actual unsupported index. I feel good about this one.

Tried running "gradlew test" with -Dtests.nightly=true but had to terminate as it was taking awfully long.

Nov 12 '25 08:11 rahulgoswami