DAOS-17300: RPC Protocol Versioning Enhancement for Rolling Upgrades
-
Added a generic daos_version_t type to encapsulate DAOS RPM version system RPC protocol.
-
Server's runtime daos_version_t is now returned in @daos_req_comm_out for future extensions.
-
Previously, servers assumed identical DAOS versions. This change introduces:
- Runtime DAOS version tracking in each xstream
- Each DAOS module maintains its version-to-RPC-protocol mapping
- Modules use current system runtime protocol to determine correct RPC protocol
- Currently supports single mapping (will expand to two entries for rolling upgrades)
-
Code refactored to separate RPC protocol selection logic between client and server.
-
Implement IV version awareness support by adding a version field to ds_iv_key structure. The cache lookup will now use both key and version number to uniquely identify entries, enabling proper handling of multiple cache versions for the same class during rolling upgrades.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Errors are component not formatted correctly,Ticket number suffix is not a number. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data https://daosio.atlassian.net/browse/DAOS-17300:
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16396/2/testReport/
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16396/2/execution/node/1341/log
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16396/2/display/redirect
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16396/3/display/redirect
Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16396/9/execution/node/860/log
Sorry but there is conflict again...
Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16396/10/execution/node/860/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16396/11/testReport/
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16396/12/execution/node/1269/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16396/13/testReport/
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16396/14/execution/node/1268/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16396/15/execution/node/457/log
With regard to previous comment, it's not blocking and please address if you have to re-push. Is there a ticket you can reference/link in https://daosio.atlassian.net/browse/DAOS-17300 that covers the go changes required to use the new DaosVersion dRPC field?
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change. And there is no ticket for that yet.
With regard to previous comment, it's not blocking and please address if you have to re-push. Is there a ticket you can reference/link in https://daosio.atlassian.net/browse/DAOS-17300 that covers the go changes required to use the new DaosVersion dRPC field?
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change. And there is no ticket for that yet.
I think you should create the ticket then as part of the epic and link to the SRS then update the comment to reference and track the work.
With regard to previous comment, it's not blocking and please address if you have to re-push. Is there a ticket you can reference/link in https://daosio.atlassian.net/browse/DAOS-17300 that covers the go changes required to use the new DaosVersion dRPC field?
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change. And there is no ticket for that yet.I think you should create the ticket then as part of the epic and link to the SRS then update the comment to reference and track the work.
With regard to previous comment, it's not blocking and please address if you have to re-push. Is there a ticket you can reference/link in https://daosio.atlassian.net/browse/DAOS-17300 that covers the go changes required to use the new DaosVersion dRPC field?
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change. And there is no ticket for that yet.I think you should create the ticket then as part of the epic and link to the SRS then update the comment to reference and track the work.
Sure, will do this.
With regard to previous comment, it's not blocking and please address if you have to re-push. Is there a ticket you can reference/link in https://daosio.atlassian.net/browse/DAOS-17300 that covers the go changes required to use the new DaosVersion dRPC field?
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change. And there is no ticket for that yet.I think you should create the ticket then as part of the epic and link to the SRS then update the comment to reference and track the work.
@tanabarr I create the ticket here: https://daosio.atlassian.net/browse/DAOS-17896
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change.
Just a thought: Looking at this PR, I actually think including the DAOS version in the protobuf struct may be superfluous. All communications at the control plane level include the DAOS version, including Join operations--we already use this for interop detection, so maybe there is a way to use what we already have, if we need to persist that information to track rolling upgrade.
If this is too painful maybe putting it in the JoinReq is still the right approach.
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change.
Just a thought: Looking at this PR, I actually think including the DAOS version in the protobuf struct may be superfluous. All communications at the control plane level include the DAOS version, including Join operations--we already use this for interop detection, so maybe there is a way to use what we already have, if we need to persist that information to track rolling upgrade.
If this is too painful maybe putting it in the JoinReq is still the right approach.
I just happened to notice this in my feed... For the record all of the control plane RPCs are already automatically versioned: https://github.com/daos-stack/daos/blob/master/src/control/lib/control/interceptors.go#L80
There is also an interoperability framework for control plane components here: https://github.com/daos-stack/daos/blob/master/src/control/build/interop.go#L151
So, as Kris said, it is not necessary to add the version to protobuf messages if the goal is to define interoperability constraints between control plane components (dmg, daos_server, daos_agent). However, if there is a possibility that an engine could be a different version than its local daos_server process (this is currently not supported), then it may be useful to allow the engine to provide its own version via the upcall dRPC messages.
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change.
Just a thought: Looking at this PR, I actually think including the DAOS version in the protobuf struct may be superfluous. All communications at the control plane level include the DAOS version, including Join operations--we already use this for interop detection, so maybe there is a way to use what we already have, if we need to persist that information to track rolling upgrade. If this is too painful maybe putting it in the JoinReq is still the right approach.
I just happened to notice this in my feed... For the record all of the control plane RPCs are already automatically versioned: https://github.com/daos-stack/daos/blob/master/src/control/lib/control/interceptors.go#L80
There is also an interoperability framework for control plane components here: https://github.com/daos-stack/daos/blob/master/src/control/build/interop.go#L151
So, as Kris said, it is not necessary to add the version to protobuf messages if the goal is to define interoperability constraints between control plane components (dmg, daos_server, daos_agent). However, if there is a possibility that an engine could be a different version than its local daos_server process (this is currently not supported), then it may be useful to allow the engine to provide its own version via the upcall dRPC messages.
@mjmac Currently we assume daos engines/daos_server always in the same version between different servers, the goal of this pr is to allow engines to be different versions, and during rolling upgrade process, system existed mixed engine versions(control plane too). current plan is to store rolling upgrade states in MS(include running version), when engine tried to join the system, it get that version.
@kjacque We need check if there are better way, we could merge the logic here.
btw, here is design of rolling upgrade:https://daosio.atlassian.net/wiki/spaces/DC/pages/12289900560/DAOS+Rolling+Upgrade
I thought when we implement new dmg rolling upgrade command, command will store DaosVersion in MS DB(as we discussed before), this code will be adjusted accordingly. so now it is more like a rpc protocol change.
Just a thought: Looking at this PR, I actually think including the DAOS version in the protobuf struct may be superfluous. All communications at the control plane level include the DAOS version, including Join operations--we already use this for interop detection, so maybe there is a way to use what we already have, if we need to persist that information to track rolling upgrade.
Currently, the control plane context carries the DaosVersionHeader (DAOS build version) which is used for interoperability checking in unaryVersionInterceptor(). While this serves as a basic safety check, it proves insufficient for rolling upgrade scenarios.
Consider a system with Component A (v2.8.0) and Component B (v3.0.0). For backward compatibility, Component B must communicate with Component A using protocols supported by the older version. This implies that when protocol changes exist between different versions, the system requires a negotiation mechanism between components.
The current control plane implementation only performs basic sanity checks for interoperability. Proper protocol change support remains unimplemented. This PR introduces negotiation capability between different engine versions. Similar functionality should be implemented on the control plane side to fully support rolling upgrades.