tp-qemu icon indicating copy to clipboard operation
tp-qemu copied to clipboard

netperf: implements dynamic NUMA binding

Open mcasquer opened this issue 1 year ago • 8 comments

netperf: implements dynamic NUMA binding

The test was taking the last NUMA node to bind the VM's memory. In some systems the last NUMA node could have no memory and/or CPUs assigned, updates the test to take the first valid node.

Signed-off-by: mcasquer [email protected] ID: 3321

mcasquer avatar Jan 19 '25 19:01 mcasquer

Test cases didn't pass but the error is not related with this patch

 (1/3) Host_RHEL.m9.u5.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: STARTED
 (1/3) Host_RHEL.m9.u5.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: ERROR: local variable 'client_pub_ip' referenced before assignment (384.46 s)
 (2/3) Host_RHEL.m9.u5.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: STARTED
 (2/3) Host_RHEL.m9.u5.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: ERROR: local variable 'client_pub_ip' referenced before assignment (376.33 s)
 (3/3) Host_RHEL.m9.u5.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: STARTED
 (3/3) Host_RHEL.m9.u5.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: ERROR: Timeout expired while waiting for shell command to complete: 'C:\\rhcygwin\\Cygwin.bat -i /Cygwin-Terminal.ico -'    (output: 'The system cannot find the path specified.\n\nC:\\>') (272.61 s)
RESULTS    : PASS 0 | ERROR 3 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

mcasquer avatar Jan 19 '25 20:01 mcasquer

@heywji could you review this PR? Thanks !

mcasquer avatar Jan 19 '25 20:01 mcasquer

LGTM. Thanks to Mario's efforts and help.


Hello, other reviewers.

Let me explain some background here. The netperf of my netkvm test loop is netperf_stress_test.cfg. But someday I type the test case name as 'netperf', some errors reported. After talking with @mcasquer, we confirmed it was because of the NUMA node memory issue.

It's an actual NUMA issue improvement, even though it is not directly connected with my netkvm test loop.

heywji avatar Jan 20 '25 06:01 heywji

@zhencliu @PaulYuuu please, could you review this PR and Wenkang's comment? Thanks!

mcasquer avatar Jan 20 '25 06:01 mcasquer

@heywji please, whenever is possible, could you test the latest patch changes? I saw in your host some QEMU and avocado processes running already... thanks !

mcasquer avatar Jan 24 '25 06:01 mcasquer

LGTM

 (1/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: STARTED
 (1/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (1596.48 s)
 (2/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: STARTED
 (2/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.43 s)
 (3/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: STARTED
.default_install.aio_threads.q35: STARTED
 (1/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (1596.48 s)
 (2/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: STARTED
 (2/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.43 s)
 (3/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: STARTED
 (3/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.13 s)
 (4/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: STARTED
 (4/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.19 s)
 (5/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.default.host_guest.q35: STARTED (5/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.default.host_guest.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.22 s)
 (6/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.default.host_guest.best_registry_setting.q35: STARTED
 (6/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.default.host_guest.best_registry_setting.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.12 s)
 (7/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.default.host_guest.cygwin.q35: STARTED
 (7/7) Host_RHEL.m9.u5.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.i386.io-github-autotest-qemu.netperf.default.host_guest.cygwin.q35: CANCEL: The node: 7 used for VM pinning is not valid (17.06 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 6

heywji avatar Jan 24 '25 15:01 heywji

@zhencliu @PaulYuuu please, could you review again this PR? Thanks !

mcasquer avatar Feb 11 '25 09:02 mcasquer

@zhencliu @PaulYuuu please, could you review again this PR? Thanks !

hi Mario, it looks there are still 2 pending comments inline from my side, esp. for the preprocess, your test passed because you don't need a second disk, and image1 has already been created, IMO. But it may be more safe to call the preprocess of both vms and images when not_preprocess = yes, what do you think?

zhencliu avatar Feb 11 '25 09:02 zhencliu

@zhencliu @PaulYuuu please, could you review again this PR? Thanks !

hi Mario, it looks there are still 2 pending comments inline from my side, esp. for the preprocess, your test passed because you don't need a second disk, and image1 has already been created, IMO. But it may be more safe to call the preprocess of both vms and images when not_preprocess = yes, what do you think?

@zhencliu code updated, @heywji please could you give another try?

mcasquer avatar Feb 26 '25 09:02 mcasquer

@mcasquer Yes, I am testing it. I will update the patch's result when it is done.

heywji avatar Mar 04 '25 05:03 heywji

LGTM

heywji avatar Mar 05 '25 14:03 heywji

@zhencliu any more comments on this PR?

mcasquer avatar Mar 06 '25 06:03 mcasquer

Tests results with Win10 VM

 (1/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: STARTED
 (1/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (2526.82 s)
 (2/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: STARTED
 (2/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: ERROR: cannot access local variable 'client_pub_ip' where it is not associated with a value (365.73 s)
 (3/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: STARTED
 (3/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: ERROR: cannot access local variable 'client_pub_ip' where it is not associated with a value (390.45 s)
 (4/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: STARTED
 (4/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: ERROR: Timeout expired while waiting for shell command to complete: 'C:\\rhcygwin\\Cygwin.bat -i /Cygwin-Terminal.ico -'    (output: 'The system cannot find the path specified.\n\nC:\\>') (263.49 s)
RESULTS    : PASS 1 | ERROR 3 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

Asi discussed with @zhencliu and @heywji these failures are not related with this patch as running the tests without it will lead to the same results, but indeed it can be appreciated that forcing to boot up the VM with the last NUMA node has been fixed.

[stdlog] 2025-03-12 05:12:50,677 avocado.virttest.qemu_vm qemu_vm          L3839 INFO | Running qemu command (reformatted):
[stdlog] MALLOC_PERTURB_=1 numactl \
[stdlog]     -m 0  /usr/libexec/qemu-kvm \

mcasquer avatar Mar 12 '25 09:03 mcasquer

Tests results with Win10 VM

 (1/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: STARTED
 (1/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads.q35: PASS (2526.82 s)
 (2/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: STARTED
 (2/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.q35: ERROR: cannot access local variable 'client_pub_ip' where it is not associated with a value (365.73 s)
 (3/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: STARTED
 (3/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.best_registry_setting.q35: ERROR: cannot access local variable 'client_pub_ip' where it is not associated with a value (390.45 s)
 (4/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: STARTED
 (4/4) Host_RHEL.m10.u0.ovmf.qcow2.virtio_scsi.up.virtio_net.Guest.Win10.x86_64.io-github-autotest-qemu.netperf.with_jumbo.host_guest.cygwin.q35: ERROR: Timeout expired while waiting for shell command to complete: 'C:\\rhcygwin\\Cygwin.bat -i /Cygwin-Terminal.ico -'    (output: 'The system cannot find the path specified.\n\nC:\\>') (263.49 s)
RESULTS    : PASS 1 | ERROR 3 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

Asi discussed with @zhencliu and @heywji these failures are not related with this patch as running the tests without it will lead to the same results, but indeed it can be appreciated that forcing to boot up the VM with the last NUMA node has been fixed.

[stdlog] 2025-03-12 05:12:50,677 avocado.virttest.qemu_vm qemu_vm          L3839 INFO | Running qemu command (reformatted):
[stdlog] MALLOC_PERTURB_=1 numactl \
[stdlog]     -m 0  /usr/libexec/qemu-kvm \

Thanks for the information. client_pub_ip is defined inside a elif code block, if the test cannot run into the elif, client_pub_ip is not defined, you can push another patch to fix it :-)

zhencliu avatar Mar 12 '25 10:03 zhencliu