NO-ISSUE: Fix TNA and TNF dummy ip for ipv6
@giladravid16: This pull request explicitly references no jira issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
@giladravid16: This pull request references MGMT-22546 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@giladravid16: This pull request references MGMT-22546 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.
In response to this:
A bit of backround: When installing TNA/TNF clusters using assisted service, one of the master nodes acts as the bootstrap. So during the installation there will only be one master node, but we need two in order to configure keepalived. We cannot wait until the bootstrap finishes and becomes a master, because then no node will have the API vip. To circumvent that we temporarily add a dummy ip to the list of nodes. After the bootstrap becomes a master node, it's ip replaces the dummy ip in the list.
What does this PR do: Right now the dummy ip is always 0.0.0.0, but that doesn't work for clusters that are using ipv6. This PR fixes that so that if the vip is an ipv4 address then the dummy ip will be 0.0.0.0, but if not the dummy ip will be ::
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
After this is merged, can it also be backported all the way to 4.20?
@giladravid16 yes it can but you are responsible for a Jira hygiene. You need a bug opened with Target Version field 4.22.0; only then we can go with backport
/test ?
@mkowalski: The following commands are available to trigger required jobs:
/test e2e-metal-ipi-ovn-ipv6
/test gofmt
/test govet
/test images
/test okd-scos-images
/test security
/test unit
/test verify-deps
The following commands are available to trigger optional jobs:
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv4
/test e2e-openstack
/test okd-scos-e2e-aws-ovn
Use /test all to run the following jobs that were automatically triggered:
pull-ci-openshift-baremetal-runtimecfg-main-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-baremetal-runtimecfg-main-gofmt
pull-ci-openshift-baremetal-runtimecfg-main-govet
pull-ci-openshift-baremetal-runtimecfg-main-images
pull-ci-openshift-baremetal-runtimecfg-main-okd-scos-images
pull-ci-openshift-baremetal-runtimecfg-main-security
pull-ci-openshift-baremetal-runtimecfg-main-unit
pull-ci-openshift-baremetal-runtimecfg-main-verify-deps
In response to this:
/test ?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/payload ?
@mkowalski: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.
/payload periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-dualstack
@mkowalski: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-dualstack
@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-dualstack
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e8b9efc0-ebab-11f0-816d-1074d99e701d-0
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview
@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/08c0f110-ebac-11f0-830b-52dbe5cfa958-0
/approve /lgtm
/hold Waiting for payload jobs to succeed
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6
@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5a664740-ebac-11f0-8322-31ccbdeeac4f-0
/lgtm cancel
@giladravid16, even with your patch the e2e-agent-ovn-two-node-arbiter-ipv6 failed. Please look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-baremetal-runtimecfg-369-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6/2008835257893130240 and figure out what went wrong.
Did you manually test this patch and it worked? Or is it just an attempt to fix?
@mkowalski I tested it with Assisted's CI in https://github.com/openshift/release/pull/72884. I used a custom release image of OCP 4.20 and this PR.
I tested it with Assisted's CI in openshift/release#72884
Do you mean the ci/rehearse/openshift/assisted-service/master/edge-e2e-metal-assisted-kube-api-tna-4-19 test or some other? If some other, can I please get a link to the passing Prow job? I am trying to see something that was IPv6 and succeeded.
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6
@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/56d17600-ebd3-11f0-8308-6f803e4d321f-0
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ovn-two-node-arbiter-ipv6
@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ovn-two-node-arbiter-ipv6
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/76af1d60-ebd3-11f0-9c0c-89ae09dab6f5-0
@mkowalski yes, that's the job.
The e2e-agent-ovn-two-node-arbiter-ipv6 job failed before the installation started - it failed during preparing-for-installation.
The reason is that the arbiter node was unable to pull an image, even though the masters were able to.
I'm pretty sure the issue is that the arbiter node doesn't have enough ram, during this phase the each host's filesystem should be half of its ram.
The arbiter has 8GB of ram, so it's filesystem is 4GB, and the image it fails to pull is 2GB.
The jobs where I am sure we do IPv6 are
- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-baremetal-runtimecfg-369-nightly-4.21-e2e-agent-ovn-two-node-arbiter-ipv6/2008905598048931840
- https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-baremetal-runtimecfg-369-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6/2008905376468045824
but they do not seem to pass with your PR.
I don't see where ci/rehearse/openshift/assisted-service/master/edge-e2e-metal-assisted-kube-api-tna-4-19 would be IPv6 job, you need to help me understand this. I have looked at logs from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/72884/rehearse-72884-pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-kube-api-tna-4-19/2003771232951996416/artifacts/e2e-metal-assisted-kube-api-tna-4-19/assisted-common-gather/artifacts but nodes there have IPv4 addresses.
The job ci/rehearse/openshift/assisted-service/master/edge-e2e-metal-assisted-kube-api-tna-4-19 installs 2 clusters - one uses ipv4 and the other ipv6.
The files that belong to the ipv6 cluster have assisted-spoke-cluster-f62795d5 in their names.
For example here's a node in the cluster, and the agent cluster install (where you can see the vips).
And as I said in my previous comment, the jobs you ran fail before the installation starts. You can see it in the job's logs - the arbiter can't pull quay-proxy.ci.openshift.org/openshift/ci@sha256:aea3543b56f95f21fd574aff73c2ae7baffca24a77a7f75c26617be2e424a678 and I think it's because it doesn't have enough space for it. You can compare it to the periodic job's logs where the installation does start, but gets stuck on waiting-for-bootkube.
/approved /lgtm /verified by @giladravid16