Backward compatibility of codec formats in minor releases
This issue has been filed to facilitate and capture discussion relating to backward compatibility, specifically around updates to codec formats in minor releases.
At Elastic we eagerly adopt and deploy Lucene releases, both major and minor, but here I'm mostly concerned with minor releases. Given our early adoption and deployment we often discover serious bugs that require immediate fixing, e.g. most recently a number of IOOBE in intoBitSet implementations. It's a fact of life that bugs will almost always creep through, and we accept that. What I would like to discuss is if we can improve the upgradeability of minor releases somewhat.
A minor release can contain implementation changes, api changes to internal interfaces, new formats, etc. Commonly bugs are found in implementation changes, rather than in new formats. I would like to propose an upgrade model to minor releases that allows to adopt the new minor release without using any new formats for a period of one minor release, after which the old format writer will be removed.
Why? Such a model would allow an upgrade of a minor to be separate adopting format changes, for a limited period of time. And then for the format changes to be applied independently of the upgrade. This means that if serious bugs are found in the implementation that it is possible to "rollback" to the prior minor until a bugfix release can be cut.
While I'm clearly bias here, I do think that this would be a generally good thing and would hopefully encourage and help others to adopt releases more eagerly too.
Specifically:
Currently, when a new format is added, the prior one is moved to org.apache.lucene.backward_codecs.xxx, and the writer is removed - throws UOE. The writer only exists in the tests for backwards-codecs.
-
One concrete proposal is to continue to move the codec to
org.apache.lucene.backward_codecs.xxx, but retain the writer for one minor release, before it is move to test only. -
An alternative would be for the consumer to copy the codec format. However, for this to work then we'd need a reliable way to "override" the codec so that the copy (with the writer) would be reliably found. This is currently not possible if deployed with java modules (see https://github.com/apache/lucene/pull/14275).
I think increasing the back compat burden should be the last resort. The burden can easily hamstring the entire project: allowing a lucene version to write multiple index formats makes this much more complex and difficult to reason about.
Can we try to look at alternatives that address the root cause of underlying concerns via test improvements or otherwise? I think it would be a better approach.
I think increasing the back compat burden should be the last resort. The burden can easily hamstring the entire project: allowing a lucene version to write multiple index formats makes this much more complex and difficult to reason about.
Generally, I agree. That is why I suggested retaining the format, likely deprecated, for just a single release. Though I do see your point and the risks that it could bring.
Can we try to look at alternatives that address the root cause of underlying concerns via test improvements or otherwise? I think it would be a better approach.
Absolutely, it's awful that such bugs are creeping through the Lucene development and release process. I do think that we need do to better to catch these things before the Lucene release. That said, bugs will always find a way of escaping. With improved testing then hopefully less of them and less severe ones will escape.
Ok, I'm somewhat happy to withdraw the proposal for Lucene to carry the backward compatibility burden. That is maybe a step too far to solve a problem that only we seem to be having (since we actually use the most recent release). I'm happy for the consumer of Lucene to make the decision to carry the burden itself, if it wants to (proposal no.2 in the description). What we're missing now is the ability to do so, which I'm happy to work on solving.
disclaimer: i'm not fully up to speed on the DocIdSetIterator.intoBitSet addition that motivated this discussion, but maybe one thought is that it was backported too soon?
I'm not trying to suggest that minor releases should become "super-boring" and only have bugfixes, but... when it is an API change to such a central class like DocIdSetIterator with large impact and risk potential, baking in main might help.
I've had my own frustrations with this approach: pushing scary stuff just to main, having it bake for years, and then still not really grab attention until right at release-time, but I think this is something we could improve.
It is just a high-level idea, if the goal is to make minor releases more stable, then being selective about what is backported can really help. I still think some things are missing, it isn't a magic bullet to find all bugs if we just "delay" the feature in main. Sometimes we get lucky and tests catch stuff, but it is best if people try it out and find the bugs in it.
But I think it is more "expected" for end-users to find these kinds of "day 1" bugs in a major release versus in a minor release. It is not cool when they happen in a minor release, so thinking twice about what we backport can help.
I read your comments and I am also nervous about extending the compatibility that Lucene currently offers. I totally get the point around backporting after baking time in main, and I wonder if and how we would have caught this specific bug, for what it's worth, had we waited backporting the change that caused it. I don't know the answer to this question :)
Taking a step back, I have some high-level thoughts, and I am playing a bit devil's advocate, so bear with me here: we at Elastic ship to production Lucene releases in matter of days after they are out. We did that with Lucene 10.0, and we do that with every minor release. I love that we do that, and I hope that more users do so over time. Whenever we find a bug after the roll-out, it is extremely painful to investigate it, find a work-around, fix it upstream under time pressure, while users experience potential disruptions. I would hate if we ended up holding off the roll-out to prod due to fear of finding bugs and to avoid disruptions. Also, that would not help much currently as the roll-out to prod is exactly what helps us find bugs :) we can certainly broaden testing etc. but we all know certain things come up only in production.
The time pressure derives from the lack of ability to downgrade to the previous Lucene minor. If we were able to do that, we would do so, and have then time to investigate and do all that's needed, without the time pressure. Like Chris said, I would not consider a downgrade through major releases, but I wonder if there's some option to consider to move in that direction over time. I like the idea of isolating/containing formats changes for instance. And again, I totally see the compatibility burden, I am torn as well, but I think it's an important problem, and tackling it somehow may even help users upgrade faster if they know they have a way out potentially? Would that not be in the interest of the project (although to be weighed against the cons of the compatibility burden)?
Thanks @javanna - that captures the dilemma very well, and I really appreciate the devil’s advocate angle. Your summary gets to the heart of the issue: early adoption is good for Lucene, but it’s high-stakes for downstream users without a safe rollback path. That tension is what we’re trying to solve - not by expecting Lucene to support multiple formats indefinitely, but by introducing just enough flexibility to let early adopters help without risking production stability.
You’re right that some bugs will only surface in production, and we’ve seen time and again that early rollouts help discover those faster — which benefits the entire ecosystem. But for that to be sustainable, we need tools or policies that give us room to react. Today, a bad bug means we likely be stuck: we can’t downgrade the code because the index format changed, and we can’t write the old format because the writer was removed.
This isn't about freezing minor releases - it's about isolating risk so adopters can iterate safely and fix forward or roll back with confidence.
I really like the idea of looking at this as a way to encourage faster adoption. If users know they can upgrade quickly and safely - with a rollback path - I think we’ll see more adoption, not less.
I recognize that my initial proposals might not be the ideal solutions, and I'm not strongly advocating for either one in particular. There may be other viable alternatives worth exploring — assuming, of course, that supporting this kind of scenario is something we agree is worth addressing in Lucene.
If we had a release policy that required format changes to be released in isolation, would that have helped in this case? If we did that, you could at least revert other (non format-impacting) upgrades. I'm not sure how we could do that in practice, just wondering if it is even worth thinking about: would it help? format-changes would still be forward-only
I don't know of any software that handles file formats as proposed, I don't think it is the best solution and just adds complexity.
Along the same lines of "be careful what we backport", consider not changing file formats in minor releases as another possible solution. This is how postgres does it: https://www.postgresql.org/docs/17/upgrading.html
Overall I would say that the Lucene project benefits a lot from these new formats and the plethora of bug fixes and improvements in minor releases - I do not want to change that in any way, since it will only slow things down. I like our current pace of development, innovation, and delivery.
Carrying old formats can be done by the user if they really want to. For that to work, then we need some solution that allows the user to do that. I'll summarise this discussion and add a note in #14275.
I agree with you, I dont want to slow anything down either.
But if we were to look at say, the last 10 file format changes as a sample for analysis...
How many of these format changes could have just stayed in main branch? What features would this have blocked from being backported to minor releases?
I guess i'm throwing out the idea that keeping the format changes in main might not slow us down.