assisted-service
assisted-service copied to clipboard
OCPBUGS-30233: Filter IPs in majority group validation
https://issues.redhat.com/browse/OCPBUGS-30233 Filter the IP addresses when creating connectivity groups to hosts that belong within the machine network CIDRs (if they exist) to prevent failing "belongs-to-majority" validation. Previously, if the IPs were not filtered, the validation would fail if hosts had IPs on different NICs that couldn't connect to other hosts. These IPs were not used and caused the "belongs-to-majority" validation to fail.
List all the issues related to this PR
- [ ] New Feature
- [x] Enhancement
- [ ] Bug fix
- [ ] Tests
- [ ] Documentation
- [ ] CI/CD
What environments does this code impact?
- [ ] Automation (CI, tools, etc)
- [x] Cloud
- [x] Operator Managed Deployments
- [ ] None
How was this code tested?
- [ ] assisted-test-infra environment
- [ ] dev-scripts environment
- [x] Reviewer's test appreciated: My test environment isn't capable of installing more than a SNO
- [x] Waiting for CI to do a full test run
- [ ] Manual (Elaborate on how it was tested)
- [ ] No tests needed
Checklist
- [x] Title and description added to both, commit and PR.
- [x] Relevant issues have been associated (see CONTRIBUTING guide)
- [x] This change does not require a documentation update (docstring,
docs
, README, etc) - [x] Does this change include unit-tests (note that code changes require unit-tests)
Reviewers Checklist
- Are the title and description (in both PR and commit) meaningful and clear?
- Is there a bug required (and linked) for this change?
- Should this PR be backported?
/cc @zaneb
@CrystalChun: This pull request references Jira Issue OCPBUGS-30233, which is invalid:
- expected the bug to target the "4.16.0" version, but no target version was set
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
In response to this:
https://issues.redhat.com/browse/OCPBUGS-30233 For the majority group validation, use only filtered the IP addresses of hosts that belong within the primary machine network CIDR to prevent failing validations from IPs that aren't required.
List all the issues related to this PR
- [ ] New Feature
- [ ] Enhancement
- [ ] Bug fix
- [ ] Tests
- [ ] Documentation
- [ ] CI/CD
What environments does this code impact?
- [ ] Automation (CI, tools, etc)
- [ ] Cloud
- [ ] Operator Managed Deployments
- [x] None
How was this code tested?
- [ ] assisted-test-infra environment
- [ ] dev-scripts environment
- [x] Reviewer's test appreciated: My test environment isn't capable of installing more than a SNO
- [ ] Waiting for CI to do a full test run
- [ ] Manual (Elaborate on how it was tested)
- [ ] No tests needed
Checklist
- [x] Title and description added to both, commit and PR.
- [x] Relevant issues have been associated (see CONTRIBUTING guide)
- [x] This change does not require a documentation update (docstring,
docs
, README, etc)- [ ] Does this change include unit-tests (note that code changes require unit-tests)
Reviewers Checklist
- Are the title and description (in both PR and commit) meaningful and clear?
- Is there a bug required (and linked) for this change?
- Should this PR be backported?
/cc @zaneb
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: CrystalChun
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [CrystalChun]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@zaneb would this change cover the issue? If not, may I request your advice?
would this change cover the issue?
I have difficulty following the connectivity groups code, but it seems plausible. If we ignore the extra IP addresses at the point where we're reading the inventory then it should work as if those interfaces did not exist, which is what I think we want.
This change is incorrect since we don't have machine-networks in multi-node UMN. So there is no reason to limit the connectivity to machine-networks. In general the concept is that connectivity checks all possible connections. The belongs-to-majority-group checks if there is possible connections between hosts. @zaneb are you sure that the requested change does not allow potential failures to occur?
/test?
@ori-amizur: The following commands are available to trigger required jobs:
-
/test e2e-agent-compact-ipv4
-
/test edge-assisted-operator-catalog-publish-verify
-
/test edge-ci-index
-
/test edge-e2e-ai-operator-ztp
-
/test edge-e2e-ai-operator-ztp-sno-day2-workers
-
/test edge-e2e-ai-operator-ztp-sno-day2-workers-late-binding
-
/test edge-e2e-metal-assisted
-
/test edge-e2e-metal-assisted-4-11
-
/test edge-e2e-metal-assisted-4-12
-
/test edge-e2e-metal-assisted-cnv
-
/test edge-e2e-metal-assisted-lvm
-
/test edge-e2e-metal-assisted-odf
-
/test edge-images
-
/test edge-lint
-
/test edge-subsystem-aws
-
/test edge-subsystem-kubeapi-aws
-
/test edge-unit-test
-
/test edge-verify-generated-code
-
/test images
-
/test mce-images
The following commands are available to trigger optional jobs:
-
/test e2e-agent-ha-dualstack
-
/test e2e-agent-sno-ipv6
-
/test edge-e2e-ai-operator-disconnected-capi
-
/test edge-e2e-ai-operator-ztp-3masters
-
/test edge-e2e-ai-operator-ztp-capi
-
/test edge-e2e-ai-operator-ztp-compact-day2-masters
-
/test edge-e2e-ai-operator-ztp-compact-day2-workers
-
/test edge-e2e-ai-operator-ztp-disconnected
-
/test edge-e2e-ai-operator-ztp-hypershift-zero-nodes
-
/test edge-e2e-ai-operator-ztp-multiarch-3masters-ocp
-
/test edge-e2e-ai-operator-ztp-multiarch-sno-ocp
-
/test edge-e2e-ai-operator-ztp-node-labels
-
/test edge-e2e-ai-operator-ztp-sno-day2-masters
-
/test edge-e2e-ai-operator-ztp-sno-day2-workers-ignitionoverride
-
/test edge-e2e-metal-assisted-4-13
-
/test edge-e2e-metal-assisted-4-14
-
/test edge-e2e-metal-assisted-4-15
-
/test edge-e2e-metal-assisted-bond
-
/test edge-e2e-metal-assisted-bond-4-14
-
/test edge-e2e-metal-assisted-day2
-
/test edge-e2e-metal-assisted-day2-arm-workers
-
/test edge-e2e-metal-assisted-day2-single-node
-
/test edge-e2e-metal-assisted-external
-
/test edge-e2e-metal-assisted-external-4-14
-
/test edge-e2e-metal-assisted-ipv4v6
-
/test edge-e2e-metal-assisted-ipv6
-
/test edge-e2e-metal-assisted-kube-api-late-binding-single-node
-
/test edge-e2e-metal-assisted-kube-api-late-unbinding-ipv4-single-node
-
/test edge-e2e-metal-assisted-kube-api-net-suite
-
/test edge-e2e-metal-assisted-mce-4-11
-
/test edge-e2e-metal-assisted-mce-4-12
-
/test edge-e2e-metal-assisted-mce-4-13
-
/test edge-e2e-metal-assisted-mce-4-14
-
/test edge-e2e-metal-assisted-mce-4-15
-
/test edge-e2e-metal-assisted-mce-sno
-
/test edge-e2e-metal-assisted-metallb
-
/test edge-e2e-metal-assisted-none
-
/test edge-e2e-metal-assisted-onprem
-
/test edge-e2e-metal-assisted-single-node
-
/test edge-e2e-metal-assisted-static-ip-suite
-
/test edge-e2e-metal-assisted-static-ip-suite-4-14
-
/test edge-e2e-metal-assisted-tang
-
/test edge-e2e-metal-assisted-tpmv2
-
/test edge-e2e-metal-assisted-upgrade-agent
-
/test edge-e2e-nutanix-assisted
-
/test edge-e2e-nutanix-assisted-2workers
-
/test edge-e2e-nutanix-assisted-4-14
-
/test edge-e2e-oci-assisted
-
/test edge-e2e-oci-assisted-4-14
-
/test edge-e2e-oci-assisted-iscsi
-
/test edge-e2e-vsphere-assisted
-
/test edge-e2e-vsphere-assisted-4-12
-
/test edge-e2e-vsphere-assisted-4-13
-
/test edge-e2e-vsphere-assisted-4-14
-
/test edge-e2e-vsphere-assisted-umn
-
/test push-pr-image
Use /test all
to run the following jobs that were automatically triggered:
-
pull-ci-openshift-assisted-service-master-e2e-agent-compact-ipv4
-
pull-ci-openshift-assisted-service-master-edge-ci-index
-
pull-ci-openshift-assisted-service-master-edge-e2e-ai-operator-ztp
-
pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted
-
pull-ci-openshift-assisted-service-master-edge-images
-
pull-ci-openshift-assisted-service-master-edge-lint
-
pull-ci-openshift-assisted-service-master-edge-subsystem-aws
-
pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws
-
pull-ci-openshift-assisted-service-master-edge-unit-test
-
pull-ci-openshift-assisted-service-master-edge-verify-generated-code
-
pull-ci-openshift-assisted-service-master-images
-
pull-ci-openshift-assisted-service-master-mce-images
In response to this:
/test?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/test edge-e2e-metal-assisted-none
/test edge-e2e-metal-assisted-none
/test edge-e2e-metal-assisted-none
It seems that the test passed, But the machine-networks is empty: Here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_assisted-service/6094/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-none/1770287733328056320/artifacts/e2e-metal-assisted-none/assisted-common-gather/artifacts/2024-03-20_03-51-58_82a1c33a-f02c-4afb-81e0-838e61b5069e/metadata.json
So I wonder where did the machine-networks come from? Needs to be checked.
It seems there is something wrong with the connectivity-groups.
The result of connectivity-groups is:
{"192.168.127.0/24":["6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3"],"192.168.145.0/24":[],"IPv4":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"],"IPv6":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"]}
So there is IPv6 connectivity but there are no IPv6 addresses.
/test edge-unit-test
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 68.36%. Comparing base (
100a86f
) to head (0b7cbc0
). Report is 224 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #6094 +/- ##
==========================================
+ Coverage 68.35% 68.36% +0.01%
==========================================
Files 239 239
Lines 35524 35584 +60
==========================================
+ Hits 24283 24328 +45
- Misses 9132 9144 +12
- Partials 2109 2112 +3
Files with missing lines | Coverage Δ | |
---|---|---|
internal/cluster/cluster.go | 66.42% <ø> (+0.29%) |
:arrow_up: |
internal/network/connectivity_groups.go | 87.32% <100.00%> (+0.04%) |
:arrow_up: |
internal/network/machine_network_cidr.go | 57.25% <100.00%> (+1.58%) |
:arrow_up: |
It seems there is something wrong with the connectivity-groups. The result of connectivity-groups is:
{"192.168.127.0/24":["6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3"],"192.168.145.0/24":[],"IPv4":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"],"IPv6":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"]}
So there is IPv6 connectivity but there are no IPv6 addresses.
Oh I think it's because of the original condition here https://github.com/openshift/assisted-service/pull/6094/files#diff-e0b0e8e6b4de6465aa429384eadda9ed02b2ceb1300e6b95a2f89b8a0858f658R442 Changed it so now it shouldn't return if there are no ipv6 addresses
/test edge-unit-test
/test edge-e2e-metal-assisted-none
/test edge-e2e-metal-assisted-none
Looks like the connectivity groups are accurate now
"connectivity_majority_groups": "{\"192.168.127.0/24\":[\"7f4b73b0-14d4-4695-a185-2b62666ee175\",\"b1860fd5-3a8a-45c9-bcc5-8b41a7ed21be\",\"b8e6124e-ff3c-4958-86ae-f326607f56df\"],\"192.168.145.0/24\":[],\"IPv4\":[\"366d5130-0dd2-4543-b20b-9c518dbecc9e\",\"6593e35f-4ee2-41ad-bcb7-f88bf14bdd1b\",\"7f4b73b0-14d4-4695-a185-2b62666ee175\",\"b1860fd5-3a8a-45c9-bcc5-8b41a7ed21be\",\"b8e6124e-ff3c-4958-86ae-f326607f56df\"],\"IPv6\":[]}",
This change is incorrect because for multi-node UMN there are no machine-networks. Even if we add it and it works we cannot force it because this change is not backward compatible. If you still want it, you might need to either do not filter anything if there are no machine-networks, or create a new factory for this, and use it only in cases when there are available machine-networks. In this case it should not be used for UMN.
Modified the filter to only be used if there are machine networks! https://github.com/openshift/assisted-service/pull/6094/files#diff-e0b0e8e6b4de6465aa429384eadda9ed02b2ceb1300e6b95a2f89b8a0858f658R486
/retest
@CrystalChun: all tests passed!
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
In general the concept is that connectivity checks all possible connections. The belongs-to-majority-group checks if there is possible connections between hosts.
Right now it actually checks that there are no impossible connections between hosts. That's a very high bar.
@zaneb are you sure that the requested change does not allow potential failures to occur?
Failures could occur if the actual routes set up don't result in traffic going over the network explicitly specified as the machine network, and also don't result in a successful connection over the networks the traffic is actually sent over. But as you said, in standard assisted installations the machine network is not provided anyway. So this change will only affect ABI.
It appears that when UMN is enabled (so no VIPs) the interface used for the Node IP is the one that has the default route. If we had the full routing information we could limit to that interface and not have to rely on the user input being correct. Do we?
It appears that when UMN is enabled (so no VIPs) the interface used for the Node IP is the one that has the default route. If we had the full routing information we could limit to that interface and not have to rely on the user input being correct. Do we?
I believe it needs to be tested
Right now it actually checks that there are no impossible connections between hosts. That's a very high bar.
It is all a matter of what OCP dictates.
Failures could occur if the actual routes set up don't result in traffic going over the network explicitly specified as the machine network, and also don't result in a successful connection over the networks the traffic is actually sent over.
There might be other reasons for connectivity failures. Besides, the edge cases need to be tested in order to verify that this change doesn't lead to installation failures.
But as you said, in standard assisted installations the machine network is not provided anyway. So this change will only affect ABI.
We currently do not limit adding machine-networks in UMN. We are not aware if the invoker is ABI.
/hold This change doesn't actually act correctly with UMN and multiple machine networks defined
Also there's currently a validation that would fail this scenario (UMN, multiple machine nets) https://github.com/openshift/assisted-service/blob/17959954d606e1faac51baa251d5df3e090dd3f9/internal/cluster/validations/validations.go#L825-L828 No more than one machine network can be defined if it's not dual-stack
Right now it actually checks that there are no impossible connections between hosts. That's a very high bar.
It is all a matter of what OCP dictates.
If the nodes can talk to each other on their kubelet IPs then OCP will be happy.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale