[BUG] Cannot restore snapshots between patch versions of OpenSearch
Describe the bug
Hi, I have a use case where we want to restore the snapshot of OpenSearch between 2 patch version of OpenSearch. But constantly getting this error:
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[<SnapshotName>] the snapshot was created with OpenSearch version [1.3.11] which is higher than the version of this node [1.3.2]"}],"type":"snapshot_restore_exception","reason":"[<SnapshotName>] the snapshot was created with OpenSearch version [1.3.11] which is higher than the version of this node [1.3.2]"},"status":500}
Related component
Storage:Snapshots
To Reproduce
- Create an index in OpenSearch version 1.3.11
- Take a snapshot and then restore the snapshot on a cluster with 1.3.2 version.
Expected behavior
The expectation is snapshot should just work out of the box since the version is the patch version
Additional Details
Code reference from where exception is coming.
https://github.com/opensearch-project/OpenSearch/blob/6966267722419e7a478305313c0ddaafd37213ee/server/src/main/java/org/opensearch/snapshots/RestoreService.java#L1322-L1332
As per my understanding, this is the expected behavior. The snapshots can be forward restored upto 1 major version, but not in the backward version.
@jainankitk but this is between patch versions. If you think there is no diff between 1.3.2 vs 1.3.11 in terms of Lucene and files comp-abilities. So, in my true sense this logic/check is being ignored and may be its time to fix this.
imo this is feature/enhancement, not a bug. Also, the validation becomes complex as you potentially need to store mapping across which versions there was some codec change. Maybe there are few places in Lucene as well, which won't allow doing something like this. Also, an older patch version, is probably not even aware of newer version. This is already starting to seem like a bad idea, unless I am missing something obvious. @msfroh - what do you think?
imo this is feature/enhancement, not a bug.
I think this is still a bug. If we are thinking this as a complex solution then we are not versioning things correctly. A simple way to think is in patch version there should not be any new feature/Lucene upgrade. It should purely be a security patch release.
Also, the validation becomes complex as you potentially need to store mapping across which versions there was some codec change. Maybe there are few places in Lucene as well, which won't allow doing something like this.
In a patch version release this should not happen.
This is already starting to seem like a bad idea, unless I am missing something obvious.
A missing part here is patch version release is supposed to be just a release of security fixes nothing else
I feel this is a defensive check and can be relaxed going forward. But yes this is more like an enhancement in my opinion.
Also, an older patch version, is probably not even aware of newer version.
I am still not clear on how to handle this part while restoring 1.3.11 snapshot on 1.3.2. Won't the older version run into some issues while building the SnapshotInfo for newer version?
SnapshotState snapshotState = state == null ? null : SnapshotState.valueOf(state);
Version version = this.version == -1 ? Version.CURRENT : Version.fromId(this.version);
Maybe there are ways around it like omit the patch version from the snapshotInfo, but not completely sure if that is a good idea. Might be good to look at some code changes and evaluate based on the PR?
Also, an older patch version, is probably not even aware of newer version.
I am still not clear on how to handle this part while restoring
1.3.11snapshot on1.3.2. Won't the older version run into some issues while building theSnapshotInfofor newer version?SnapshotState snapshotState = state == null ? null : SnapshotState.valueOf(state); Version version = this.version == -1 ? Version.CURRENT : Version.fromId(this.version);Maybe there are ways around it like omit the patch version from the snapshotInfo, but not completely sure if that is a good idea. Might be good to look at some code changes and evaluate based on the PR?
It might run into these problems, but a question to ask here is why would we want to build the patch version information in the snapshot or why we want to de-serialize this information while restoring the patch? One reason could be OpenSearch might not be following the correct practices for patch version and hence we need to check this which is highly unlikely :) . Another point could be, this is just the code we inherited from fork and didn't paid much attention which seems to be fine.
I feel we should revisit the snapshot restore logic and fix this. Atleast per my knowledge and look at the code I don't see a technical reason for not supporting this. @jainankitk if there are any please do let me know.
It might run into these problems, but a question to ask here is why would we want to build the patch version information in the snapshot or why we want to de-serialize this information while restoring the patch?
The patch version information is built into the snapshot as that is tied to the lucene segment version. As per my understanding, Lucene does not support this scenario (forward compatibility even across patch versions) and hence, it is not feasible (should not be done) in OpenSearch. Take patch versions of 2_19 for example, where lucene patch version has changed with every release.
public static final Version V_2_19_2 = new Version(2190299, org.apache.lucene.util.Version.LUCENE_9_12_1);
public static final Version V_2_19_3 = new Version(2190399, org.apache.lucene.util.Version.LUCENE_9_12_2);
public static final Version V_2_19_4 = new Version(2190499, org.apache.lucene.util.Version.LUCENE_9_12_3);
Now, we could add some checks in OpenSearch to support snapshot restore if Lucene version is same across 2 patch versions, but then again, the older version needs to know about the newer version for that to happen. Maybe it can go through the segment infos to validate the Lucene versions are same?
The patch version information is built into the snapshot as that is tied to the lucene segment version. Take patch versions of 2_19 for example, where lucene patch version has changed with every release.
If underlying Lucene has strong check on version and the patch versions are different, it does make things harder and we need to understand Lucene's rationale better (as storage format I can understand, but would be good to understand concrete scenarios).
I am still not clear on how to handle this part while restoring 1.3.11 snapshot on 1.3.2. Won't the older version run into some issues while building the SnapshotInfo for newer version?
Now, we could add some checks in OpenSearch to support snapshot restore if Lucene version is same across 2 patch versions, but then again, the older version needs to know about the newer version for that to happen. Maybe it can go through the segment infos to validate the Lucene versions are same?
I think it does surpass that as the validation happens later
In this case the question becomes whether is it safe to relax this check if Lucene version is same? I am not sure about the surface area in OpenSearch that need to relax this though as there could be other places where there is hard check.
private static void validateSnapshotRestorable(final String repository, final SnapshotInfo snapshotInfo) {
if (!snapshotInfo.state().restorable()) {
throw new SnapshotRestoreException(
new Snapshot(repository, snapshotInfo.snapshotId()),
"unsupported snapshot state [" + snapshotInfo.state() + "]"
);
}
if (Version.CURRENT.before(snapshotInfo.version())) {
throw new SnapshotRestoreException(
new Snapshot(repository, snapshotInfo.snapshotId()),
"the snapshot was created with OpenSearch version ["
+ snapshotInfo.version()
+ "] which is higher than the version of this node ["
+ Version.CURRENT
+ "]"
);
}
}
I think this is still a bug. If we are thinking this as a complex solution then we are not versioning things correctly. A simple way to think is in patch version there should not be any new feature/Lucene upgrade. It should purely be a security patch release.
@navneet1v One problem is enforcement - we need a way to enforce compatibility and contract not being broken and should cover all exposed interfaces. For example can a bug fix imply it introduces a new field which is backward compatible, but something not readable by older version. We need enforceable checks to ensure this doesn't happen in patch version.
It might run into these problems, but a question to ask here is why would we want to build the patch version information in the snapshot or why we want to de-serialize this information while restoring the patch?
The patch version information is built into the snapshot as that is tied to the lucene segment version. As per my understanding, Lucene does not support this scenario (forward compatibility even across patch versions) and hence, it is not feasible (should not be done) in OpenSearch. Take patch versions of 2_19 for example, where lucene patch version has changed with every release.
public static final Version V_2_19_2 = new Version(2190299, org.apache.lucene.util.Version.LUCENE_9_12_1); public static final Version V_2_19_3 = new Version(2190399, org.apache.lucene.util.Version.LUCENE_9_12_2); public static final Version V_2_19_4 = new Version(2190499, org.apache.lucene.util.Version.LUCENE_9_12_3);Now, we could add some checks in OpenSearch to support snapshot restore if Lucene version is same across 2 patch versions, but then again, the older version needs to know about the newer version for that to happen. Maybe it can go through the segment infos to validate the Lucene versions are same?
@jainankitk I think one thing which we are missing here is a patch version is supposed to be compatible. We changing the lucene version in patch version feels to me a wrong thing to do. Also, if Lucene in its patch version is not compatible then something clearly is missing and we should work with Lucene maintainers to understand why is that the case. Because a patch is supposed to be a bug fix and should be compatible with other patch version on same minor version.
The patch version information is built into the snapshot as that is tied to the lucene segment version. Take patch versions of 2_19 for example, where lucene patch version has changed with every release.
If underlying Lucene has strong check on version and the patch versions are different, it does make things harder and we need to understand Lucene's rationale better (as storage format I can understand, but would be good to understand concrete scenarios).
+1. Mentioned same in my last comment.
@navneet1v One problem is enforcement - we need a way to enforce compatibility and contract not being broken and should cover all exposed interfaces. For example can a bug fix imply it introduces a new field which is backward compatible, but something not readable by older version. We need enforceable checks to ensure this doesn't happen in patch version.
I agree on this. For this special cases is what we have the BWC tests is for. We can enhance the BWC tests to ensure that for patch version it checks both forward and backward compatibility. Or we rely on maintainers to make sure that compatibility is maintained which is harder to do if its not automated.
BTW semver suggest we should not be making backward incompatible changes in patch version: https://semver.org/#spec-item-6 so adding a field to fix the bug seems very unreasonable to me
BTW semver suggest we should not be making backward incompatible changes in patch version: https://semver.org/#spec-item-6 so adding a field to fix the bug seems very unreasonable to me
I agree with viewpoint on not making such changes in patch version. However just from pure semver technical aspect, adding a new field is technically allowed as long as it defaults its value on absence and doesn't break upgrade. The problem here is adding new field isn't "forward" compatible i.e older version of Lucene can't read newer version segment with new fields.
This seems to be a very interesting discussion,
The only realistic safe approach is conditional restore like:
Allow backward restore only if:
-
The snapshot’s Lucene version == target’s Lucene version and
-
Snapshot metadata format is readable by the older version and
-
Additional guardrails/BWC tests cover “patch backward restore”
But even that has a catch the thread calls out: the older OpenSearch must be able to understand that the newer snapshot is “safe”. That likely requires parsing segment infos (or similar) instead of relying purely on OpenSearch version IDs.
it opens lots of Issues and It creates a maintenance trap like Every patch release must prove it’s backward-restorable to previous patches.