baremetal-runtimecfg icon indicating copy to clipboard operation
baremetal-runtimecfg copied to clipboard

NO-ISSUE: Fix TNA and TNF dummy ip for ipv6

Open giladravid16 opened this issue 5 months ago • 3 comments

giladravid16 avatar Sep 08 '25 07:09 giladravid16

@giladravid16: This pull request explicitly references no jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Sep 08 '25 07:09 openshift-ci-robot

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

openshift-ci[bot] avatar Sep 08 '25 07:09 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Dec 07 '25 09:12 openshift-bot

@giladravid16: This pull request references MGMT-22546 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Dec 24 '25 13:12 openshift-ci-robot

@giladravid16: This pull request references MGMT-22546 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

In response to this:

A bit of backround: When installing TNA/TNF clusters using assisted service, one of the master nodes acts as the bootstrap. So during the installation there will only be one master node, but we need two in order to configure keepalived. We cannot wait until the bootstrap finishes and becomes a master, because then no node will have the API vip. To circumvent that we temporarily add a dummy ip to the list of nodes. After the bootstrap becomes a master node, it's ip replaces the dummy ip in the list.

What does this PR do: Right now the dummy ip is always 0.0.0.0, but that doesn't work for clusters that are using ipv6. This PR fixes that so that if the vip is an ipv4 address then the dummy ip will be 0.0.0.0, but if not the dummy ip will be ::

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Dec 24 '25 13:12 openshift-ci-robot

After this is merged, can it also be backported all the way to 4.20?

giladravid16 avatar Dec 24 '25 16:12 giladravid16

@giladravid16 yes it can but you are responsible for a Jira hygiene. You need a bug opened with Target Version field 4.22.0; only then we can go with backport

mkowalski avatar Jan 07 '26 09:01 mkowalski

/test ?

mkowalski avatar Jan 07 '26 09:01 mkowalski

@mkowalski: The following commands are available to trigger required jobs:

/test e2e-metal-ipi-ovn-ipv6
/test gofmt
/test govet
/test images
/test okd-scos-images
/test security
/test unit
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv4
/test e2e-openstack
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-baremetal-runtimecfg-main-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-baremetal-runtimecfg-main-gofmt
pull-ci-openshift-baremetal-runtimecfg-main-govet
pull-ci-openshift-baremetal-runtimecfg-main-images
pull-ci-openshift-baremetal-runtimecfg-main-okd-scos-images
pull-ci-openshift-baremetal-runtimecfg-main-security
pull-ci-openshift-baremetal-runtimecfg-main-unit
pull-ci-openshift-baremetal-runtimecfg-main-verify-deps

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Jan 07 '26 09:01 openshift-ci[bot]

/payload ?

mkowalski avatar Jan 07 '26 09:01 mkowalski

@mkowalski: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

openshift-ci[bot] avatar Jan 07 '26 09:01 openshift-ci[bot]

/payload periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-dualstack

mkowalski avatar Jan 07 '26 09:01 mkowalski

@mkowalski: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

openshift-ci[bot] avatar Jan 07 '26 09:01 openshift-ci[bot]

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-dualstack

mkowalski avatar Jan 07 '26 09:01 mkowalski

@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-dualstack

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e8b9efc0-ebab-11f0-816d-1074d99e701d-0

openshift-ci[bot] avatar Jan 07 '26 09:01 openshift-ci[bot]

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview

mkowalski avatar Jan 07 '26 09:01 mkowalski

@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/08c0f110-ebac-11f0-830b-52dbe5cfa958-0

openshift-ci[bot] avatar Jan 07 '26 09:01 openshift-ci[bot]

/approve /lgtm

/hold Waiting for payload jobs to succeed

mkowalski avatar Jan 07 '26 09:01 mkowalski

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6

mkowalski avatar Jan 07 '26 09:01 mkowalski

@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5a664740-ebac-11f0-8322-31ccbdeeac4f-0

openshift-ci[bot] avatar Jan 07 '26 09:01 openshift-ci[bot]

/lgtm cancel

@giladravid16, even with your patch the e2e-agent-ovn-two-node-arbiter-ipv6 failed. Please look at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-baremetal-runtimecfg-369-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6/2008835257893130240 and figure out what went wrong.

Did you manually test this patch and it worked? Or is it just an attempt to fix?

mkowalski avatar Jan 07 '26 12:01 mkowalski

@mkowalski I tested it with Assisted's CI in https://github.com/openshift/release/pull/72884. I used a custom release image of OCP 4.20 and this PR.

giladravid16 avatar Jan 07 '26 13:01 giladravid16

I tested it with Assisted's CI in openshift/release#72884

Do you mean the ci/rehearse/openshift/assisted-service/master/edge-e2e-metal-assisted-kube-api-tna-4-19 test or some other? If some other, can I please get a link to the passing Prow job? I am trying to see something that was IPv6 and succeeded.

/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6

mkowalski avatar Jan 07 '26 14:01 mkowalski

@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/56d17600-ebd3-11f0-8308-6f803e4d321f-0

openshift-ci[bot] avatar Jan 07 '26 14:01 openshift-ci[bot]

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ovn-two-node-arbiter-ipv6

mkowalski avatar Jan 07 '26 14:01 mkowalski

@mkowalski: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ovn-two-node-arbiter-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/76af1d60-ebd3-11f0-9c0c-89ae09dab6f5-0

openshift-ci[bot] avatar Jan 07 '26 14:01 openshift-ci[bot]

@mkowalski yes, that's the job. The e2e-agent-ovn-two-node-arbiter-ipv6 job failed before the installation started - it failed during preparing-for-installation. The reason is that the arbiter node was unable to pull an image, even though the masters were able to. I'm pretty sure the issue is that the arbiter node doesn't have enough ram, during this phase the each host's filesystem should be half of its ram. The arbiter has 8GB of ram, so it's filesystem is 4GB, and the image it fails to pull is 2GB.

giladravid16 avatar Jan 07 '26 14:01 giladravid16

The jobs where I am sure we do IPv6 are

  1. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-baremetal-runtimecfg-369-nightly-4.21-e2e-agent-ovn-two-node-arbiter-ipv6/2008905598048931840
  2. https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-baremetal-runtimecfg-369-nightly-4.22-e2e-agent-ovn-two-node-arbiter-ipv6/2008905376468045824

but they do not seem to pass with your PR.

I don't see where ci/rehearse/openshift/assisted-service/master/edge-e2e-metal-assisted-kube-api-tna-4-19 would be IPv6 job, you need to help me understand this. I have looked at logs from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/72884/rehearse-72884-pull-ci-openshift-assisted-service-master-edge-e2e-metal-assisted-kube-api-tna-4-19/2003771232951996416/artifacts/e2e-metal-assisted-kube-api-tna-4-19/assisted-common-gather/artifacts but nodes there have IPv4 addresses.

mkowalski avatar Jan 07 '26 15:01 mkowalski

The job ci/rehearse/openshift/assisted-service/master/edge-e2e-metal-assisted-kube-api-tna-4-19 installs 2 clusters - one uses ipv4 and the other ipv6. The files that belong to the ipv6 cluster have assisted-spoke-cluster-f62795d5 in their names. For example here's a node in the cluster, and the agent cluster install (where you can see the vips).

And as I said in my previous comment, the jobs you ran fail before the installation starts. You can see it in the job's logs - the arbiter can't pull quay-proxy.ci.openshift.org/openshift/ci@sha256:aea3543b56f95f21fd574aff73c2ae7baffca24a77a7f75c26617be2e424a678 and I think it's because it doesn't have enough space for it. You can compare it to the periodic job's logs where the installation does start, but gets stuck on waiting-for-bootkube.

giladravid16 avatar Jan 08 '26 07:01 giladravid16

/approved /lgtm /verified by @giladravid16

mkowalski avatar Jan 08 '26 15:01 mkowalski