os-autoinst-distri-opensuse icon indicating copy to clipboard operation
os-autoinst-distri-opensuse copied to clipboard

ipxe_install: Make the set_bootscript_hdd() delay configurable

Open mdoucha opened this issue 1 year ago • 9 comments

Some workers need more than 2 minutes to reach PXE bootloader after power-on. Add a variable PXE_BOOT_TIME to control the bootscript switch delay for UEFI machines so that it can be configured on each worker.

This is the correct solution for #17769

  • Related ticket: N/A
  • Needles: N/A
  • Verification runs:
    • without PXE_BOOT_TIME: https://openqa.suse.de/tests/12223015
    • with PXE_BOOT_TIME: https://openqa.suse.de/tests/12223016
    • without AUTOYAST: https://openqa.suse.de/tests/12223020

mdoucha avatar Sep 15 '23 10:09 mdoucha

Both verification runs worked as expected, note the difference in ipxe_install run time. The failure in autoyast/console is expected until #17721 gets merged.

mdoucha avatar Sep 15 '23 11:09 mdoucha

Please allow me to share my opinion and help a bit :)
I followed the discussion in Slack, and thank you very much that we indicate the root cause is that set_bootscript_hdd is only specified to Supermicro machine, but hurt for Dell, HP, Lenovo machines that we also have . Since we are hardly to collecting and setting the exact booting time for each machine, IMO, it sounds more efficiency and effective to specify for the machine brand type(Supermicro), rather than to each machine.

@czerw I guess set_bootscript_hdd was added for kernel testing need on supermicro machine, do you have any opinion or suggestion that we can help here to make the solution be much easier for each team for each different machine?

Another solution is https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17769/files by @Dawei-Pang

cachen avatar Sep 18 '23 04:09 cachen

Please allow me to share my opinion and help a bit :) I followed the discussion in Slack, and thank you very much that we indicate the root cause is that set_bootscript_hdd is only specified to Supermicro machine, but hurt for Dell, HP, Lenovo machines that we also have . Since we are hardly to collecting and setting the exact booting time for each machine, IMO, it sounds more efficiency and effective to specify for the machine brand type(Supermicro), rather than to each machine.

I never said that Supermicro is the only vendor with this problem. I have no idea how many other vendors may be affected.

Measuring boot time is as easy as running one job on the machine with video capture enabled (NOVIDEO=0) and then reading timestamps from the video.

mdoucha avatar Sep 18 '23 08:09 mdoucha

@Dawei-Pang @cachen I'm sorry I don't understand what is the issue. Fixed sleep is in the code for 3 years already since https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/11273 and nobody complained. This PR just improves existing code to have delay configurable. If the variable is not defined, it will just use same behavior like before. I will also reply in https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/17769/f

czerw avatar Sep 18 '23 08:09 czerw

Please allow me to share my opinion and help a bit :) I followed the discussion in Slack, and thank you very much that we indicate the root cause is that set_bootscript_hdd is only specified to Supermicro machine, but hurt for Dell, HP, Lenovo machines that we also have . Since we are hardly to collecting and setting the exact booting time for each machine, IMO, it sounds more efficiency and effective to specify for the machine brand type(Supermicro), rather than to each machine.

I never said that Supermicro is the only vendor with this problem. I have no idea how many other vendors may be affected.

Measuring boot time is as easy as running one job on the machine with video capture enabled (NOVIDEO=0) and then reading timestamps from the video.

From the discussion in the slack, I suppose you verified the PR#17583 on Supermicro machine, if the PR#17583 verified on other brand/model machine, please let me know. Thanks!

If the PR#17583 only verified or workaround for supermicro machine, please add a checker to limit for supermicro only, because the code without set_bootscript_hdd if get_var('IPXE_UEFI'); can run successfully without any issue for a long time.

Dawei-Pang avatar Sep 18 '23 12:09 Dawei-Pang

@Dawei-Pang @cachen I'm sorry I don't understand what is the issue. Fixed sleep is in the code for 3 years already since #11273 and nobody complained. This PR just improves existing code to have delay configurable. If the variable is not defined, it will just use same behavior like before. I will also reply in #17769

Thanks, at beginning, I think this is timing issue, after some research, yes, the sleep code is running for a long time. The issue comes from PR#17583, adding set_bootscript_hdd if get_var('IPXE_UEFI'); is root cause.

Dawei-Pang avatar Sep 18 '23 12:09 Dawei-Pang

Please allow me to share my opinion and help a bit :) I followed the discussion in Slack, and thank you very much that we indicate the root cause is that set_bootscript_hdd is only specified to Supermicro machine, but hurt for Dell, HP, Lenovo machines that we also have . Since we are hardly to collecting and setting the exact booting time for each machine, IMO, it sounds more efficiency and effective to specify for the machine brand type(Supermicro), rather than to each machine.

I never said that Supermicro is the only vendor with this problem. I have no idea how many other vendors may be affected.

@mdoucha Hello Martin, no worry I didn't say you said that, but how I concluded from yours and Dawei's observation when read the discussion in slack, and I believe now you are aware from Dawei, besides Supermicro, that are other vendor's are affected.

"One more time: SuperMicro UEFI machines ignore the "force PXE boot" flag." "They have to be configured to ALWAYS use the following BIOS boot order:" "I can confirm the SLE installer change the UEFI boot order to installed OS as default behaviour on HPE, DELL and LENOVO/IBM system except for other action"

Measuring boot time is as easy as running one job on the machine with video capture enabled (NOVIDEO=0) and then reading timestamps from the video.

Think about if the code can be much easier to adapting for dozens of machines with different vendor. And if the solution in pr#17769 bring difficult or challenge for kernel team to adapting your test on your machines.


@czerw Hope Dawei has explained to you in his reply, that the main issue and root cause is not the timing, but was introduced by PR#17583, adding set_bootscript_hdd if get_var('IPXE_UEFI'); so whether revert PR#17583, or how to improving the PR is the topic.


I will stop here to do any micro-management to influence the technical decision :-) The last suggestion that Vit and I from TL's view is, perhaps it will be helpful to have a debugging session that Dawei and Martin together, to learn from each other what is the challenge and what is the exactly expected, and of course other relevant engineers can be invited if need.

cachen avatar Sep 19 '23 09:09 cachen

I've changed the set_bootscript_hdd() logic to use a special flag that should be added only to the workers/jobs that need it. The original IPXE_UEFI condition was wrong.

mdoucha avatar Sep 19 '23 11:09 mdoucha

Verification runs updated.

mdoucha avatar Sep 19 '23 11:09 mdoucha

@Dawei-Pang @frankenmichl could we have a look on this one please and maybe have it merged?

czerw avatar Apr 19 '24 08:04 czerw

Please allow me to involve @Julie-CAO to see if the changes break her testing.

Dawei-Pang avatar Apr 22 '24 07:04 Dawei-Pang

Please allow me to involve @Julie-CAO to see if the changes break her testing.

I will verify on VIRT-AUTO tests, wait a few fours.

Julie-CAO avatar Apr 22 '24 08:04 Julie-CAO

This PR is quite old so I've just rebased it to the latest master.

mdoucha avatar Apr 22 '24 09:04 mdoucha

Please allow me to involve @Julie-CAO to see if the changes break her testing.

I will verify on VIRT-AUTO tests, wait a few fours.

I had two successful runs, I think the PR is ok. @alice-suse , are you also ok with this change? sle15sp6_kvm_on_non_uefi_machine on_UEFI_kermit_on_OSD

Julie-CAO avatar Apr 23 '24 01:04 Julie-CAO

LGTM, and verified for usb install, works too.

alice-suse avatar Apr 23 '24 03:04 alice-suse

LGTM, thank you all!

Dawei-Pang avatar Apr 30 '24 06:04 Dawei-Pang

Thanks all, I think we can merge. Just one thing came to my mind, shouldn't we mention variable also in variables.md ?

czerw avatar Apr 30 '24 07:04 czerw

Thanks all, I think we can merge. Just one thing came to my mind, shouldn't we mention variable also in variables.md ?

Added variable documentation to variables.md and rebased to latest master. No other changes.

mdoucha avatar Apr 30 '24 13:04 mdoucha