assisted-service icon indicating copy to clipboard operation
assisted-service copied to clipboard

OCPBUGS-30233: Filter IPs in majority group validation

Open CrystalChun opened this issue 11 months ago • 31 comments

https://issues.redhat.com/browse/OCPBUGS-30233 Filter the IP addresses when creating connectivity groups to hosts that belong within the machine network CIDRs (if they exist) to prevent failing "belongs-to-majority" validation. Previously, if the IPs were not filtered, the validation would fail if hosts had IPs on different NICs that couldn't connect to other hosts. These IPs were not used and caused the "belongs-to-majority" validation to fail.

List all the issues related to this PR

  • [ ] New Feature
  • [x] Enhancement
  • [ ] Bug fix
  • [ ] Tests
  • [ ] Documentation
  • [ ] CI/CD

What environments does this code impact?

  • [ ] Automation (CI, tools, etc)
  • [x] Cloud
  • [x] Operator Managed Deployments
  • [ ] None

How was this code tested?

  • [ ] assisted-test-infra environment
  • [ ] dev-scripts environment
  • [x] Reviewer's test appreciated: My test environment isn't capable of installing more than a SNO
  • [x] Waiting for CI to do a full test run
  • [ ] Manual (Elaborate on how it was tested)
  • [ ] No tests needed

Checklist

  • [x] Title and description added to both, commit and PR.
  • [x] Relevant issues have been associated (see CONTRIBUTING guide)
  • [x] This change does not require a documentation update (docstring, docs, README, etc)
  • [x] Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

/cc @zaneb

CrystalChun avatar Mar 18 '24 21:03 CrystalChun

@CrystalChun: This pull request references Jira Issue OCPBUGS-30233, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

https://issues.redhat.com/browse/OCPBUGS-30233 For the majority group validation, use only filtered the IP addresses of hosts that belong within the primary machine network CIDR to prevent failing validations from IPs that aren't required.

List all the issues related to this PR

  • [ ] New Feature
  • [ ] Enhancement
  • [ ] Bug fix
  • [ ] Tests
  • [ ] Documentation
  • [ ] CI/CD

What environments does this code impact?

  • [ ] Automation (CI, tools, etc)
  • [ ] Cloud
  • [ ] Operator Managed Deployments
  • [x] None

How was this code tested?

  • [ ] assisted-test-infra environment
  • [ ] dev-scripts environment
  • [x] Reviewer's test appreciated: My test environment isn't capable of installing more than a SNO
  • [ ] Waiting for CI to do a full test run
  • [ ] Manual (Elaborate on how it was tested)
  • [ ] No tests needed

Checklist

  • [x] Title and description added to both, commit and PR.
  • [x] Relevant issues have been associated (see CONTRIBUTING guide)
  • [x] This change does not require a documentation update (docstring, docs, README, etc)
  • [ ] Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

/cc @zaneb

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Mar 18 '24 21:03 openshift-ci-robot

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] avatar Mar 18 '24 21:03 openshift-ci[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CrystalChun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Mar 18 '24 21:03 openshift-ci[bot]

@zaneb would this change cover the issue? If not, may I request your advice?

CrystalChun avatar Mar 18 '24 21:03 CrystalChun

would this change cover the issue?

I have difficulty following the connectivity groups code, but it seems plausible. If we ignore the extra IP addresses at the point where we're reading the inventory then it should work as if those interfaces did not exist, which is what I think we want.

zaneb avatar Mar 19 '24 01:03 zaneb

This change is incorrect since we don't have machine-networks in multi-node UMN. So there is no reason to limit the connectivity to machine-networks. In general the concept is that connectivity checks all possible connections. The belongs-to-majority-group checks if there is possible connections between hosts. @zaneb are you sure that the requested change does not allow potential failures to occur?

ori-amizur avatar Mar 19 '24 09:03 ori-amizur

/test?

ori-amizur avatar Mar 19 '24 10:03 ori-amizur

@ori-amizur: The following commands are available to trigger required jobs:

  • /test e2e-agent-compact-ipv4
  • /test edge-assisted-operator-catalog-publish-verify
  • /test edge-ci-index
  • /test edge-e2e-ai-operator-ztp
  • /test edge-e2e-ai-operator-ztp-sno-day2-workers
  • /test edge-e2e-ai-operator-ztp-sno-day2-workers-late-binding
  • /test edge-e2e-metal-assisted
  • /test edge-e2e-metal-assisted-4-11
  • /test edge-e2e-metal-assisted-4-12
  • /test edge-e2e-metal-assisted-cnv
  • /test edge-e2e-metal-assisted-lvm
  • /test edge-e2e-metal-assisted-odf
  • /test edge-images
  • /test edge-lint
  • /test edge-subsystem-aws
  • /test edge-subsystem-kubeapi-aws
  • /test edge-unit-test
  • /test edge-verify-generated-code
  • /test images
  • /test mce-images

The following commands are available to trigger optional jobs:

  • /test e2e-agent-ha-dualstack
  • /test e2e-agent-sno-ipv6
  • /test edge-e2e-ai-operator-disconnected-capi
  • /test edge-e2e-ai-operator-ztp-3masters
  • /test edge-e2e-ai-operator-ztp-capi
  • /test edge-e2e-ai-operator-ztp-compact-day2-masters
  • /test edge-e2e-ai-operator-ztp-compact-day2-workers
  • /test edge-e2e-ai-operator-ztp-disconnected
  • /test edge-e2e-ai-operator-ztp-hypershift-zero-nodes
  • /test edge-e2e-ai-operator-ztp-multiarch-3masters-ocp
  • /test edge-e2e-ai-operator-ztp-multiarch-sno-ocp
  • /test edge-e2e-ai-operator-ztp-node-labels
  • /test edge-e2e-ai-operator-ztp-sno-day2-masters
  • /test edge-e2e-ai-operator-ztp-sno-day2-workers-ignitionoverride
  • /test edge-e2e-metal-assisted-4-13
  • /test edge-e2e-metal-assisted-4-14
  • /test edge-e2e-metal-assisted-4-15
  • /test edge-e2e-metal-assisted-bond
  • /test edge-e2e-metal-assisted-bond-4-14
  • /test edge-e2e-metal-assisted-day2
  • /test edge-e2e-metal-assisted-day2-arm-workers
  • /test edge-e2e-metal-assisted-day2-single-node
  • /test edge-e2e-metal-assisted-external
  • /test edge-e2e-metal-assisted-external-4-14
  • /test edge-e2e-metal-assisted-ipv4v6
  • /test edge-e2e-metal-assisted-ipv6
  • /test edge-e2e-metal-assisted-kube-api-late-binding-single-node
  • /test edge-e2e-metal-assisted-kube-api-late-unbinding-ipv4-single-node
  • /test edge-e2e-metal-assisted-kube-api-net-suite
  • /test edge-e2e-metal-assisted-mce-4-11
  • /test edge-e2e-metal-assisted-mce-4-12
  • /test edge-e2e-metal-assisted-mce-4-13
  • /test edge-e2e-metal-assisted-mce-4-14
  • /test edge-e2e-metal-assisted-mce-4-15
  • /test edge-e2e-metal-assisted-mce-sno
  • /test edge-e2e-metal-assisted-metallb
  • /test edge-e2e-metal-assisted-none
  • /test edge-e2e-metal-assisted-onprem
  • /test edge-e2e-metal-assisted-single-node
  • /test edge-e2e-metal-assisted-static-ip-suite
  • /test edge-e2e-metal-assisted-static-ip-suite-4-14
  • /test edge-e2e-metal-assisted-tang
  • /test edge-e2e-metal-assisted-tpmv2
  • /test edge-e2e-metal-assisted-upgrade-agent
  • /test edge-e2e-nutanix-assisted
  • /test edge-e2e-nutanix-assisted-2workers
  • /test edge-e2e-nutanix-assisted-4-14
  • /test edge-e2e-oci-assisted
  • /test edge-e2e-oci-assisted-4-14
  • /test edge-e2e-oci-assisted-iscsi
  • /test edge-e2e-vsphere-assisted
  • /test edge-e2e-vsphere-assisted-4-12
  • /test edge-e2e-vsphere-assisted-4-13
  • /test edge-e2e-vsphere-assisted-4-14
  • /test edge-e2e-vsphere-assisted-umn
  • /test push-pr-image

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-assisted-service-master-e2e-agent-compact-ipv4
  • pull-ci-openshift-assisted-service-master-edge-ci-index
  • pull-ci-openshift-assisted-service-master-edge-e2e-ai-operator-ztp
  • pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted
  • pull-ci-openshift-assisted-service-master-edge-images
  • pull-ci-openshift-assisted-service-master-edge-lint
  • pull-ci-openshift-assisted-service-master-edge-subsystem-aws
  • pull-ci-openshift-assisted-service-master-edge-subsystem-kubeapi-aws
  • pull-ci-openshift-assisted-service-master-edge-unit-test
  • pull-ci-openshift-assisted-service-master-edge-verify-generated-code
  • pull-ci-openshift-assisted-service-master-images
  • pull-ci-openshift-assisted-service-master-mce-images

In response to this:

/test?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Mar 19 '24 10:03 openshift-ci[bot]

/test edge-e2e-metal-assisted-none

ori-amizur avatar Mar 19 '24 10:03 ori-amizur

/test edge-e2e-metal-assisted-none

CrystalChun avatar Mar 20 '24 03:03 CrystalChun

/test edge-e2e-metal-assisted-none

It seems that the test passed, But the machine-networks is empty: Here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_assisted-service/6094/pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-none/1770287733328056320/artifacts/e2e-metal-assisted-none/assisted-common-gather/artifacts/2024-03-20_03-51-58_82a1c33a-f02c-4afb-81e0-838e61b5069e/metadata.json

So I wonder where did the machine-networks come from? Needs to be checked.

ori-amizur avatar Mar 20 '24 05:03 ori-amizur

It seems there is something wrong with the connectivity-groups. The result of connectivity-groups is: {"192.168.127.0/24":["6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3"],"192.168.145.0/24":[],"IPv4":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"],"IPv6":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"]}

So there is IPv6 connectivity but there are no IPv6 addresses.

ori-amizur avatar Mar 20 '24 07:03 ori-amizur

/test edge-unit-test

ori-amizur avatar Mar 20 '24 09:03 ori-amizur

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.36%. Comparing base (100a86f) to head (0b7cbc0). Report is 224 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6094      +/-   ##
==========================================
+ Coverage   68.35%   68.36%   +0.01%     
==========================================
  Files         239      239              
  Lines       35524    35584      +60     
==========================================
+ Hits        24283    24328      +45     
- Misses       9132     9144      +12     
- Partials     2109     2112       +3     
Files with missing lines Coverage Δ
internal/cluster/cluster.go 66.42% <ø> (+0.29%) :arrow_up:
internal/network/connectivity_groups.go 87.32% <100.00%> (+0.04%) :arrow_up:
internal/network/machine_network_cidr.go 57.25% <100.00%> (+1.58%) :arrow_up:

... and 13 files with indirect coverage changes

codecov[bot] avatar Mar 20 '24 09:03 codecov[bot]

It seems there is something wrong with the connectivity-groups. The result of connectivity-groups is: {"192.168.127.0/24":["6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3"],"192.168.145.0/24":[],"IPv4":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"],"IPv6":["31b68c6d-cd3a-494c-8fb6-68784dda95ae","6043b820-6b7c-4b3b-a5c0-ca0f2b1a82dc","899a891c-ed6a-4e67-ba58-72588a95a9e0","c5895010-acda-4155-9b8d-7a4de74a82e3","feeccca3-46e8-45c6-9ca0-fbde91638fb7"]}

So there is IPv6 connectivity but there are no IPv6 addresses.

Oh I think it's because of the original condition here https://github.com/openshift/assisted-service/pull/6094/files#diff-e0b0e8e6b4de6465aa429384eadda9ed02b2ceb1300e6b95a2f89b8a0858f658R442 Changed it so now it shouldn't return if there are no ipv6 addresses

CrystalChun avatar Mar 21 '24 00:03 CrystalChun

/test edge-unit-test

CrystalChun avatar Mar 21 '24 00:03 CrystalChun

/test edge-e2e-metal-assisted-none

CrystalChun avatar Mar 21 '24 00:03 CrystalChun

/test edge-e2e-metal-assisted-none

CrystalChun avatar Mar 21 '24 15:03 CrystalChun

Looks like the connectivity groups are accurate now "connectivity_majority_groups": "{\"192.168.127.0/24\":[\"7f4b73b0-14d4-4695-a185-2b62666ee175\",\"b1860fd5-3a8a-45c9-bcc5-8b41a7ed21be\",\"b8e6124e-ff3c-4958-86ae-f326607f56df\"],\"192.168.145.0/24\":[],\"IPv4\":[\"366d5130-0dd2-4543-b20b-9c518dbecc9e\",\"6593e35f-4ee2-41ad-bcb7-f88bf14bdd1b\",\"7f4b73b0-14d4-4695-a185-2b62666ee175\",\"b1860fd5-3a8a-45c9-bcc5-8b41a7ed21be\",\"b8e6124e-ff3c-4958-86ae-f326607f56df\"],\"IPv6\":[]}",

CrystalChun avatar Mar 21 '24 17:03 CrystalChun

This change is incorrect because for multi-node UMN there are no machine-networks. Even if we add it and it works we cannot force it because this change is not backward compatible. If you still want it, you might need to either do not filter anything if there are no machine-networks, or create a new factory for this, and use it only in cases when there are available machine-networks. In this case it should not be used for UMN.

Modified the filter to only be used if there are machine networks! https://github.com/openshift/assisted-service/pull/6094/files#diff-e0b0e8e6b4de6465aa429384eadda9ed02b2ceb1300e6b95a2f89b8a0858f658R486

CrystalChun avatar Mar 21 '24 17:03 CrystalChun

/retest

CrystalChun avatar Mar 21 '24 21:03 CrystalChun

@CrystalChun: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Mar 22 '24 00:03 openshift-ci[bot]

In general the concept is that connectivity checks all possible connections. The belongs-to-majority-group checks if there is possible connections between hosts.

Right now it actually checks that there are no impossible connections between hosts. That's a very high bar.

@zaneb are you sure that the requested change does not allow potential failures to occur?

Failures could occur if the actual routes set up don't result in traffic going over the network explicitly specified as the machine network, and also don't result in a successful connection over the networks the traffic is actually sent over. But as you said, in standard assisted installations the machine network is not provided anyway. So this change will only affect ABI.

zaneb avatar Mar 27 '24 03:03 zaneb

It appears that when UMN is enabled (so no VIPs) the interface used for the Node IP is the one that has the default route. If we had the full routing information we could limit to that interface and not have to rely on the user input being correct. Do we?

zaneb avatar Mar 28 '24 04:03 zaneb

It appears that when UMN is enabled (so no VIPs) the interface used for the Node IP is the one that has the default route. If we had the full routing information we could limit to that interface and not have to rely on the user input being correct. Do we?

I believe it needs to be tested

ori-amizur avatar Mar 28 '24 10:03 ori-amizur

Right now it actually checks that there are no impossible connections between hosts. That's a very high bar.

It is all a matter of what OCP dictates.

Failures could occur if the actual routes set up don't result in traffic going over the network explicitly specified as the machine network, and also don't result in a successful connection over the networks the traffic is actually sent over.

There might be other reasons for connectivity failures. Besides, the edge cases need to be tested in order to verify that this change doesn't lead to installation failures.

But as you said, in standard assisted installations the machine network is not provided anyway. So this change will only affect ABI.

We currently do not limit adding machine-networks in UMN. We are not aware if the invoker is ABI.

ori-amizur avatar Mar 28 '24 10:03 ori-amizur

/hold This change doesn't actually act correctly with UMN and multiple machine networks defined

CrystalChun avatar Apr 01 '24 23:04 CrystalChun

Also there's currently a validation that would fail this scenario (UMN, multiple machine nets) https://github.com/openshift/assisted-service/blob/17959954d606e1faac51baa251d5df3e090dd3f9/internal/cluster/validations/validations.go#L825-L828 No more than one machine network can be defined if it's not dual-stack

CrystalChun avatar Apr 01 '24 23:04 CrystalChun

Right now it actually checks that there are no impossible connections between hosts. That's a very high bar.

It is all a matter of what OCP dictates.

If the nodes can talk to each other on their kubelet IPs then OCP will be happy.

zaneb avatar Apr 04 '24 23:04 zaneb

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Jul 04 '24 01:07 openshift-bot