lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Relax Lucene Index Upgrade Policy to Allow Safe Upgrades Across Multiple Major Versions

Open markrmiller opened this issue 1 year ago • 7 comments

Description

TLDR: Relax index upgrade policy across major versions to only be as strict as necessary.

Here is an attempted summary of a recent discussion about this.

Currently, Lucene's policy requires a full reindex when upgrading across more than one major version, which can create significant friction for users with large indexes. We propose relaxing this policy to allow upgrades across multiple major versions when it is safe to do so. The goal is to provide flexibility for users without compromising data integrity or flexibility.

Proposed Changes:

  • Modify Upgrade Policy: Allow upgrades across multiple major versions, replacing the existing restriction with a configurable MIN_SUPPORTED_MAJOR version in Version.java.

  • Controlled Version Bumping: Bump MIN_SUPPORTED_MAJOR only when necessary, due to index format changes that prevent safe upgrades (e.g., changes to norms encoding).

  • Improved Documentation: Clearly document which versions can be safely upgraded to the current version without reindexing.

  • Retain Reindexing When Necessary: Ensure that reindexing is still required when necessary to maintain correctness or prevent the propagation of corruption.

Benefits:

  • Reduces friction and operational overhead for users with large indexes.

  • Facilitates more frequent major releases by reducing mandatory reindexing.

  • Maintains safety and integrity by reindexing only when required.

Implementation Plan:

  • Modify Version.java to use a configurable MIN_SUPPORTED_MAJOR.

  • Update the index upgrade logic to check against MIN_SUPPORTED_MAJOR rather than just the previous major version.

  • Enhance documentation to provide clear guidelines on safe upgrade paths and scenarios requiring reindexing.

Request for Feedback: We welcome feedback from the community on this proposal, especially regarding its potential impact, implementation details, and any concerns about safety and backward compatibility.

  • Note: Upgrading from Lucene 20 to Lucene 23 would require first going from 20 to 21, from 21 to 22, and then 22 to 23. Unless of course a change occurred in one of those versions that would prevent you from doing so, in which case a reindex would be required.

markrmiller avatar Sep 16 '24 18:09 markrmiller

I have had many discussions on this topic of file format bw compat over the years, because users would ideally like to think of their indexes as never expiring. If this is the problem that should be solved, then there are two main options that I can think of:

  • increasing backward compatibility of already written data,
  • performing a periodic transparent background reindexing.

I have developed a preference for the second option. It is cheap in hardware costs when you compare the storage cost of storing an index for ~3 years (which is about the duration of our backward compatibility window) with the cost of reindexing the same index. And it comes with the great benefit that it can also be taken as an opportunity to index data in a more modern way, (e.g. switching from trie fields to points in Lucene 6, switching scoring factors from doc values to FeatureField in Lucene 8, enabling vector search in addition to lexical search in Lucene 9, enabling sparse indexing in Lucene 10, etc.).

The way I'm thinking of it is that you would create a point-in-time view of your index, reindex it into a new index, stop the world while you're replaying operations since the point-in-time view was taken and points are swapped from the old index to the new index, and finally remove the old index. Given the required orchestration that is needed, it's probably best solved on top of Lucene (in Solr, Elasticsearch, or luceneserver), but we could look into adding tooling for this in Lucene?

That said, I think there's benefits to your suggestion of decoupling major versions from backward compatibility, I would just use it to make it easier for us to do more frequent major versions without shortening our backward compatibility window, rather than to increase our backward compatibility window?

jpountz avatar Sep 18 '24 12:09 jpountz

Thank you, Adrien, for your thoughtful response and for sharing your expertise on this topic. Your insights are valuable, and I'd like to address a few points and seek some clarification.

First, I want to emphasize that the two approaches we're discussing - relaxing the upgrade policy and implementing background reindexing - are not mutually exclusive. Both have merit and could potentially be implemented to serve different use cases and user needs.

Relaxed Upgrade Policy: This approach aims to reduce friction for upgrades by allowing them across multiple major versions when safe to do so.

Background Reindexing: This method, as you've outlined, provides a path for long-term index modernization and feature adoption.

I'd like to clarify that our original proposal isn't about extending the backward compatibility window. Rather, it's about allowing index upgrades as long as backward compatibility hasn't been broken - essentially making the upgrade check only as strict as necessary. This doesn't change any promises about the backward compatibility window itself. Could you elaborate on your concerns about extending the backward compatibility window? While that's not our intention, understanding these concerns could be useful.

Given that these approaches serve different purposes and timeframes, I believe there's value in considering both:

The relaxed upgrade policy could provide immediate benefits with relatively low development and operational costs. The background reindexing solution offers long-term benefits for feature adoption and index modernization, albeit with higher development and operational costs.

Implementing both could provide flexibility for users with different needs and resources. Users could benefit from easier upgrades in the short term while having a path to adopt new features when they're ready.

Questions

Could you share more about your concerns regarding the relaxed upgrade policy? Are there specific technical or operational issues you foresee? Do you see any conflicts or problems with implementing both approaches? Would you be open to a phased approach, where we implement the relaxed upgrade policy first and then work on tooling for background reindexing?

markrmiller avatar Oct 12 '24 21:10 markrmiller

This is an interesting proposal, and I like the idea of making version upgrades more streamlined. However, I'm a bit confused with how the proposed mechanism should play out. Could you help me understand with an example?

Suppose we were to implement this today, we would set MIN_SUPPORTED_MAJOR = 10 to correspond with Lucene 11. If there are no breaking changes in Lucene 12, 13, and 14 we would not change this value. When we make an index format change, say in Lucene 15.0.0, I assume by your proposal, we would set MIN_SUPPORTED_MAJOR to 14, and only ensure backward compatibility logic b/w v14 and v15?

Now when Lucene 16.0.0 rolls out, what happens to MIN_SUPPORTED_MAJOR version value? If it is kept at 14, then v16 will need to carry forward the backward compatibility logic. Today, by design, we only need to do it for the last release. Is the idea that we'll upgrade MIN_SUPPORTED_MAJOR to 15, even if there is nothing breaking b/w 15 and 16? Remembering this logic feels trappy? Or maybe I'm just confused with how this will work. Hopefully you have a better example :)

vigyasharma avatar May 05 '25 07:05 vigyasharma

Thank you for your question, Vigya.

Will v16 still need to ship the 14-format code?

Yes—until a new on-disk break forces MIN_SUPPORTED_MAJOR to move again.

Do we ever bump even when nothing broke?

No. We bump only on an incompatible format change; otherwise, the constant stays put. With the caveat that nothing is promised to the user beyond one major version, and something could come up that a developer decides warrants bumping anyway.

Let me try to clarify with a concrete example.

The idea is that MIN_SUPPORTED_MAJOR represents "the oldest major version that the current version can read directly" - not a version-by-version compatibility chain.

Here's how it would work:

Let's say we implement this in Lucene 11 and set MIN_SUPPORTED_MAJOR = 10, meaning Lucene 11 can read Lucene 10 indexes directly.

  • If Lucene 12 has no breaking index format changes, we'd keep MIN_SUPPORTED_MAJOR = 10
  • If Lucene 13 also has no breaking changes, it stays at 10
  • If Lucene 14 has no breaking changes, still 10

Now, if Lucene 15 introduces a breaking index format change (e.g., new norms encoding), we'd set MIN_SUPPORTED_MAJOR = 14. This means Lucene 15 can read indexes from version 14 and no older.

When Lucene 16 comes along:

  • If there are no breaking changes, MIN_SUPPORTED_MAJOR remains 14
  • If there are breaking changes, we'd bump to MIN_SUPPORTED_MAJOR = 15

So you would need to retain codec read code for versions ≥ MIN_SUPPORTED_MAJOR, but the assumption is this would typically be a low cost for the user benefit. Where it’s not, there is no promise, and MIN_SUPPORTED_MAJOR could be raised even where no break has forced it as a kind of exceptional case. Again, no promises have been made to users.

Does that help clarify the proposal? I think your concerns are valid. We would definitely need to ensure this mechanism is well-documented and clearly understood. <subjective_statement> Still, it doesn’t seem like a significant lift given what is already required to make changes and ensure one major release back compat. </subjective_statement>

markrmiller avatar May 09 '25 04:05 markrmiller

Thanks for the example Mark, that was helpful. I think the proposal makes sense. It makes upgrades across versions easier without adding significant backward compatibility overhead.

I'm not sure I fully grok Adrien's concerns. It seems to me that this change doesn't really increase backward compatibility of already written data. We can bump up the MIN_SUPPORTED_MAJOR_VERSION whenever breaking changes are made, effectively only keeping compatibility with the last format change. Users will still need to reindex when they jump multiple incompatible versions. We just don't force a reindex when it's not needed. Am I missing something?

vigyasharma avatar May 11 '25 00:05 vigyasharma

I think there may be confusion around:

  1. minimum created version <-- reflects lucene version that first created the index
  2. minimum version of any segment <-- reflects what actual backwards compat code we need to support.

The first one here, we can be lazy about and only bump when certain rarer changes are made (such as lossy parts around norms, maybe corruption bugs). Today we "bump it" implicitly even if there isn't a good reason. I think the changes that require this are rare, but we do need to retain the facility to make such changes.

If we are lazy about the minimum created version, and only bump it when we need to, users can merge segments, rather than reindex, to get to the latest version in most cases. And it doesn't require additional costs such as dragging extra backwards codecs around.

rmuir avatar May 11 '25 01:05 rmuir

Yeah, that would be preferable to a bunch of versioned code that essentially all reads the same format. I suppose I was thinking of using the config for "what is the minimum version I can read" over "what version was this written with" because my immediate thought was you are lying about the version with the latter - but you would of course just consider it the index format version it was written with rather than the actual Lucene version.

markrmiller avatar May 11 '25 02:05 markrmiller

Hi all,

So the proposal is a simple change: stop forcing a full re‑index on every major release unless we actually break the on‑disk format. All we need is to maintain one constant manually instead of recomputing LATEST.major ‑ 1.


1 Motivation


  • Today MIN_SUPPORTED_MAJOR auto‑advances on every major. Example: Lucene 14 refuses any index created by Lucene 12 or older—even if you’ve merged it under 13.
  • Most majors don’t break format, so we make users re‑index for no technical reason.
  • Pinning the constant to the last breaking version lets users skip majors when it’s safe.

2 Two Compatibility Gates


A. Index‑Opening Gate (creation version)

indexCreatedVersionMajor is written once at the first commit and never changes. Lucene opens an index only if

indexCreatedVersionMajor ≥ MIN_SUPPORTED_MAJOR

B. Codec‑Reader Gate (segment format)

Lucene ships codec readers for current + previous major versions (e.g. 14 & 13). If no format break happened between 12 → 13, the “13” reader also parses 12 segments. Merging rewrites segments to the current codec but does not change the creation version.

Key point: merging can satisfy the codec‑reader gate, but the creation‑version gate still applies. Relaxing that gate is exactly what this proposal does.


3 Policy (“Lazy MIN_SUPPORTED_MAJOR”)


  1. Evaluate MIN_SUPPORTED_MAJOR at every major release.

    • Lucene 11.0 sets public static final int MIN_SUPPORTED_MAJOR = 10;

    • Lucene 12, 13 & 14 review the value but keep it at 10 as long as no on‑disk break occurs.

    • If a breaking change lands in Lucene 15.0, we bump the constant to 14 (the last compatible major).

    • But if we confirm that no format breaks occurred all the way from 9 → 10 → 11, we could choose to set the constant even lower, e.g. 9. (That would give users on Lucene 9 indexes a direct path to 11+, at the cost of extra testing; opinions welcome.)

  2. Release‑manager checklist

    • “Did any commit since the last major introduce an incompatible on‑disk change?”
    • If yes → bump the constant. If no → leave it.
  3. Scope unchanged

    • We still ship codec readers for current + previous major only.
    • We may bump the constant without a break if support costs spike—no new promises are made.

4 Upgrade Scenarios (after change)


From → To Format break? MIN_SUPPORTED_MAJOR Opens? What you do
10 → 14 No 10 Open on 14; optional forceMerge
10 → 14 Break at 13 12 Re‑index (creation 10 < 12)
13 → 15 Break at 15 14 Re‑index (creation 13 < 14)
13 → 14 10 Opens fine

5 Code Changes


  • Version.java Replace the computed constant with a literal and explanatory Javadoc: public static final int MIN_SUPPORTED_MAJOR = 10;

  • Tests

    • Update back‑compat tests to use the constant.
    • Add a test that checks upgrades from allowed versions.
  • No other core changes needed — all open‑time checks already funnel through Version.MIN_SUPPORTED_MAJOR.


6 Documentation Changes


  • CHANGES.txt — record the new policy and initial value (10).
  • MIGRATE.md — add the table above plus a worked example (10 → 14).
  • dev‑docs/releaseWizard.yaml — add the release‑manager checklist.
  • Javadoc on Version.MIN_SUPPORTED_MAJOR — include rationale, bump criteria, and examples.
  • Error messagesIndexFormatTooOldException shows creation version vs. minimum and suggest actions.

I've got an initial PR for this up as 15012.

— Mark

markrmiller avatar Jul 30 '25 14:07 markrmiller

@rmuir does this match what you were thinking?

markrmiller avatar Jul 31 '25 04:07 markrmiller

Other than putting an extra effort on the shoulders of the release manager, I don't have any strong opinion on this. Uwe (I think) mentioned that we should increase the cadence of majors to align it with new Java features - if we're to follow this plan then this patch seems even more appealing because we can keep indexes compatible for longer (assuming nothing significant changes).

dweiss avatar Jul 31 '25 09:07 dweiss

I'd like to understand the practical impact of this change. To answer the question "Would it really help?" can we say what impact it would have had on past releases if we had had this policy in place all along? Or said another way: what index format changes have we made that would have caused a MIN_SUPPORTED_VERSION bump in the past? I assume that adding new formats wouldn't cause a bump, and as long as we can cleanly read and merge from an old format, we wouldn't have to bump either. Given that, I don't think we introduced any incompatibilities in 9 or 10, did we? Is it really true the only issue we can point to like this is the norms changes that happened before my time being active here, or were there other things?

msokolov avatar Jul 31 '25 12:07 msokolov

One perhaps minor comment about the codec reader gate is that the reading side already supports reading from N-2 fully. I am not sure this distinction was made above. Lucene 10 reads indices created in Lucene 8.0 and onwards. That is possible via an expert DirectoryReader#open method that takes a min supported major argument (see https://github.com/apache/lucene/commit/c1ae6dc07c9a988533cbe7176bdeb49e2fca1d9c). 8.0 codecs are still shipped with Lucene 10, hence no change is really needed there. We may decide to leave 8.0 codecs around in Lucene 11 as well if we want to ship it soon? The only thing we'll need to do is indeed flip the min supported major constant so that reading from N-2 (or -3?) does not require a separate API call.

The writing side is what requires changes as far as I understand, which I think are covered by Mark's comment above.

javanna avatar Aug 12 '25 16:08 javanna

I'm only aware of the norms change. I wouldn't be surprised if there was something else, but I believe the majority of major releases have had no breaking changes. @rmuir @mikemccand @uschindler @jpountz might know of other specific breaks?

markrmiller avatar Aug 12 '25 16:08 markrmiller

One perhaps minor comment about the codec reader gate is that the reading side already supports reading from N-2 fully. I am not sure this distinction was made above. Lucene 10 reads indices created in Lucene 8.0 and onwards. That is possible via an expert DirectoryReader#open method that takes a min supported major argument (see c1ae6dc). 8.0 codecs are still shipped with Lucene 10, hence no change is really needed there. We may decide to leave 8.0 codecs around in Lucene 11 as well if we want to ship it soon? The only thing we'll need to do is indeed flip the min supported major constant so that reading from N-2 (or -3?) does not require a separate API call.

I hate this too. We all carry the burden of extra back compat, but it doesn't work for ordinary users, only big companies who know how to hold the expert API.

rmuir avatar Aug 12 '25 16:08 rmuir

an expert DirectoryReader#open method that takes a min supported major argument

Ah, thanks, I didn't know that method existed.

The idea of the PR I currently have up, is that you would only need to keep around these older codecs if the current codec can't actually read those versions - the idea being that if you didn't break anything for 5 major releases, you wouldn't have to include 5 previous codecs, unless you actually required a previous codec to read the index.

markrmiller avatar Aug 13 '25 05:08 markrmiller

Other than putting an extra effort on the shoulders of the release manager

Ideally, if anything, it would be a slightly smaller burden as they would no longer increment the min supported version. It would be incremented when a change goes in that requires it. Given the rarity and nature of such a change, its doubtful that it would be very commmon to miss I think, unlike more common breaks sometimes missing their way into MIGRATE.md for example. Unless the change was like, snuck in, and nobody knew about it.

But even if a break went in and it was missed, it wouldn't really be a very big deal. There is no promise you get this kind of upgrade, it would be noted when it became apparent, and I'm still supportive of the idea of documenting this path as taking on some possible danger vs the current options. Reindexing is always best if you can manage it.

markrmiller avatar Aug 13 '25 07:08 markrmiller

Basically by that IndexWriterConfig method you have everything that's needed to rewrite your index to a newer version (using IndexUpgrader and IndexUpgraderMergePolicy). That tool exists since long time: https://lucene.apache.org/core/10_2_2/core/org/apache/lucene/index/IndexUpgrader.html

The only problem is that the "index created" version is coded into the index metadata (segments file). By using the DirectoryReader#open Method @ https://lucene.apache.org/core/10_2_2/core/org/apache/lucene/index/DirectoryReader.html#open(org.apache.lucene.index.IndexCommit,int,java.util.Comparator) you can then trick reader to open the index originally created with a too-low version.

I still think the best would be to add an option in the IndexUpgrader tool to "modify" the index created version (on special request) loosing semantic interoperability. That's easy by opening the DirectoryReader (like seen above) and then use something like IndexWriter#addIndexes(directoryReader.leaves()) to rewrite the index to a completely new index and also preserving original segment structure. This is what I generally recommend to my customers who know the risk (e.g., they don't care about offsets or the scoring does not matter).

uschindler avatar Aug 13 '25 08:08 uschindler

Uwe, thanks for weighing in—and for the pointers to IndexUpgrader and the expert DirectoryReader#open(.., minSupportedMajor, ..). I want to make sure I understand your position: are you opposed to the proposal itself, or mainly highlighting that power users already have a path today? I’m asking because the core goal here is to make the safe path the default path for everyone, without encouraging “metadata surgery.”

Why not ask users to overwrite/“trick” the created version?

Overwriting the index’s created-version (or using the expert API to sidestep the open check) puts risk on users in ways that are hard to audit:

  • Silent break risk vs explicit guardrails. Some on-disk changes don’t hard-fail; they manifest as subtle scoring shifts, offsets/positions oddities, or retrieval regressions. If we bless multi-major opens only when no on-disk break occurred, the author of the break must consciously bump MIN_SUPPORTED_MAJOR in the same PR. That creates a crisp signal to users, rather than leaving users to guess whether their “hop” is safe.

  • Accessibility. As Robert noted, the expert API path works for sophisticated deployments, but ordinary users don’t reach for it (and shouldn’t have to). Making the policy explicit removes the “tribal knowledge” hurdle.

  • Operational clarity. “My index opens but results look off” is far harder to debug than “Lucene refused to open because created-version < minimum.” The latter points you straight to reindex; the former can burn days.

  • We aren’t expanding promises. This doesn’t extend the backcompat window. We still reserve the right to bump the constant if support costs spike or if a real on-disk break lands. We’re just aligning the open gate with actual breaks, not calendar majors.

Why support the policy change?

Breaks like norms layout changes are rare; most majors don’t alter on-disk format. Keeping MIN_SUPPORTED_MAJOR pinned to the last real break:

  • Removes unnecessary reindexing when nothing incompatible changed.
  • Reduces upgrade friction for users doing multi-major hops that are actually safe.
  • Keeps developer cost low: the author of any on-disk break bumps the constant in the same PR
  • Plays nicely with faster major cadence (e.g., aligning with new Java features) without penalizing users on storage and downtime.
  • Doesn’t balloon codecs. We still ship current + previous readers

Complementary to reindexing (not a substitute)

I’m +1 on background reindexing for long-term modernization. This type of upgraded is targeted at the many installs where reindexing is a very high cost and that cost outweighs the desire for performance/features that one would get access to via a reindex.

:bicyclist:

Uwe, are you fundamentally against adopting the “lazy MIN_SUPPORTED_MAJOR” policy (bump only on true on-disk breaks)?

— Mark

markrmiller avatar Aug 22 '25 23:08 markrmiller

+1 on background reindexing for long-term modernization and moving forward with this right now.

@uschindler - pinging you in case you missed Mark's message :)

Uwe, are you fundamentally against adopting the “lazy MIN_SUPPORTED_MAJOR” policy (bump only on true on-disk breaks)?

anshumg avatar Sep 03 '25 18:09 anshumg

+1 to this proposal. Also based on the conversation so far, would it be reasonable to set MIN_SUPPORTED_MAJOR=9 (or 8?) for Lucene 11 ?

rahulgoswami avatar Sep 07 '25 03:09 rahulgoswami

FYI: I responded to the above in the PR. TLDR: I think we could, though going backward is a bit different than going forward as the codec has already changed.

I'm going to update the PR to be ready for pushing, and if there are no hard objections, Ill look at pushing it toward the end of Community Over Code.

markrmiller avatar Sep 09 '25 04:09 markrmiller