mckernel icon indicating copy to clipboard operation
mckernel copied to clipboard

About VASP run problem

Open p00380563 opened this issue 3 years ago • 2 comments

I have a error when I run VASP where i assign the -np 22, i try 21 is OK. I am sure i allocated enough cpu cores to mckernel, can someone tell me the reason?

hareware: arm server: 128 cores; software: centos7.6 + openmpi 4.0.5 + mckernel 1.7

Error as follow: _[root@localhost VASP_bench_pt]# mpirun -np 22 --allow-run-as-root -x OMP_NUM_THREADS=1 /root/sysroot/bin/mcexec -n 22 ../../bin/vasp_std

There are not enough slots available in the system to satisfy the 22
slots that were requested by the application:

/root/sysroot/bin/mcexec

Either request fewer slots for your application, or make more slots
available for use._

The mckernel information: [root@localhost sysroot]# ./sbin/ihkosctl 0 query cpu 4-110 [root@localhost sysroot]# [root@localhost sysroot]# [root@localhost sysroot]# ./sbin/ihkosctl 0 query mem 52428800000@0,52428800000@1,52428800000@2,52428800000@3 [root@localhost sysroot]# [root@localhost sysroot]# ./sbin/ihkosctl 0 kmsg [ 0]: boot_param_size: 65536 [ 0]: %: GICv3 [ 0]: setup_arm64 done. IHK/McKernel started. [ 0]: ns_per_tsc: 10000 [ 0]: KCommand Line: hidos dump_level=24 time_sharing [ 0]: Physical memory: 0x2080310000 - 0x2cb5000000, 52425588736 bytes, 799951 pages available @ NUMA: 0 [ 0]: Physical memory: 0x4000000000 - 0x4c35000000, 52428800000 bytes, 800000 pages available @ NUMA: 1 [ 0]: Physical memory: 0x202000000000 - 0x202c35000000, 52428800000 bytes, 800000 pages available @ NUMA: 2 [ 0]: Physical memory: 0x204000000000 - 0x204c35000000, 52428800000 bytes, 800000 pages available @ NUMA: 3 [ 0]: NUMA: 0, Linux NUMA: 0, type: 1, available bytes: 52425588736, pages: 799951 [ 0]: NUMA: 1, Linux NUMA: 1, type: 1, available bytes: 52428800000, pages: 800000 [ 0]: NUMA: 2, Linux NUMA: 2, type: 1, available bytes: 52428800000, pages: 800000 [ 0]: NUMA: 3, Linux NUMA: 3, type: 1, available bytes: 52428800000, pages: 800000 [ 0]: NUMA 0 distances: 0 (10), 1 (16), 2 (32), 3 (33), [ 0]: NUMA 1 distances: 1 (10), 0 (16), 2 (25), 3 (32), [ 0]: NUMA 2 distances: 2 (10), 3 (16), 1 (25), 0 (32), [ 0]: NUMA 3 distances: 3 (10), 2 (16), 1 (32), 0 (33), [ 0]: Trampoline area: 0x0 [ 0]: # of cpus : 107 [ 0]: locals = ffff802080380000 [ 0]: BSP: 0 (HW ID: 4 @ NUMA 0) [ 0]: BSP: booted 106 AP CPUs [ 0]: Master channel init acked. [ 0]: Using Linux work IRQ for IKC IPI. [ 0]: Enable Host mapping vDSO. IHK/McKernel booted. [ 32]: schedule: WARNING can't schedule() while no preemption, cnt: 1 [ 32]: schedule: WARNING can't schedule() while no preemption, cnt: 1

p00380563 avatar Jul 02 '21 09:07 p00380563

Hi, why are you booting on 107 CPUs? If you insist on running 22 ranks it would be better to boot McKernel using a multiple of 22 cores, e.g., 88? For example, you could try mcreboot -c 40-127

In general we prefer to run on round number of CPU cores (preferably power of 2). Also, it's better to leave a few cores for Linux from each NUMA node and make sure that the McKernel cores are also balanced across NUMA domains.

bgerofi avatar Jul 02 '21 12:07 bgerofi

hi, begerofi, as your advice, i try boot 4 cores of NUMA0 for mckernel. The mckernel information: [root@localhost sysroot]# ./sbin/mcreboot.sh -c 12-15 -m 50000m@0 [root@localhost sysroot]# ./sbin/ihkosctl 0 kmsg [ 0]: boot_param_size: 65536 [ 0]: %: GICv3 [ 0]: setup_arm64 done. IHK/McKernel started. [ 0]: ns_per_tsc: 10000 [ 0]: KCommand Line: hidos dump_level=24 time_sharing [ 0]: Physical memory: 0x2080300000 - 0x2cb5000000, 52425654272 bytes, 799952 pages available @ NUMA: 0 [ 0]: NUMA: 0, Linux NUMA: 0, type: 1, available bytes: 52425654272, pages: 799952 [ 0]: NUMA 0 distances: 0 (10), [ 0]: Trampoline area: 0x0 [ 0]: # of cpus : 4 [ 0]: locals = ffff802080340000 [ 0]: BSP: 0 (HW ID: 12 @ NUMA 0) [ 0]: BSP: booted 3 AP CPUs [ 0]: Master channel init acked. [ 0]: Using Linux work IRQ for IKC IPI. [ 0]: Enable Host mapping vDSO. IHK/McKernel booted.

And i test HPL, but there is no any output , i think cpu is hang.

_[root@localhost Linux_Arm]# mpirun -np 4 --allow-run-as-root /root/sysroot/bin/mcexec -n 4 ./xhpl


  • hwloc 2.0.2rc1-git has encountered what looks like an error from the operating system.
  • Group0 (cpuset 0xffff0fff) intersects with Package (P#36 cpuset 0xffffffff,0xffff0fff nodeset 0x00000003) without inclusion!
  • Error occurred in topology.c line 1384
  • The following FAQ entry in the hwloc documentation may help:
  • What should I do when hwloc reports "operating system" warnings?
  • Otherwise please report this error message to the hwloc user's mailing list,
  • along with the files generated by the hwloc-gather-topology script. ****************************************************************************_

I try mcstat command, but the output is no change for three times: [root@localhost sysroot]# ./bin/mcstat ------- memory (GB) ------- ------- tsc ------ --- thread --- total current max system user current max 48.825 0.147 0.147 39 3 12 12 cpuacct_usage_percpu[0] = 5935640 cpuacct_usage_percpu[1] = 5942580 cpuacct_usage_percpu[2] = 5823800 cpuacct_usage_percpu[3] = 5974470 cpuacct_usage_percpu[4] = 0 cpuacct_usage_percpu[5] = 0 cpuacct_usage_percpu[6] = 0 cpuacct_usage_percpu[7] = 0 cpuacct_usage_percpu[8] = 0 cpuacct_usage_percpu[9] = 0 cpuacct_usage_percpu[10] = 0 cpuacct_usage_percpu[11] = 0 [root@localhost sysroot]# ./bin/mcstat ------- memory (GB) ------- ------- tsc ------ --- thread --- total current max system user current max 48.825 0.147 0.147 39 3 12 12 cpuacct_usage_percpu[0] = 5935640 cpuacct_usage_percpu[1] = 5942580 cpuacct_usage_percpu[2] = 5823800 cpuacct_usage_percpu[3] = 5974470 cpuacct_usage_percpu[4] = 0 cpuacct_usage_percpu[5] = 0 cpuacct_usage_percpu[6] = 0 cpuacct_usage_percpu[7] = 0 cpuacct_usage_percpu[8] = 0 cpuacct_usage_percpu[9] = 0 cpuacct_usage_percpu[10] = 0 cpuacct_usage_percpu[11] = 0 [root@localhost sysroot]# ./bin/mcstat ------- memory (GB) ------- ------- tsc ------ --- thread --- total current max system user current max 48.825 0.147 0.147 39 3 12 12 cpuacct_usage_percpu[0] = 5935640 cpuacct_usage_percpu[1] = 5942580 cpuacct_usage_percpu[2] = 5823800 cpuacct_usage_percpu[3] = 5974470 cpuacct_usage_percpu[4] = 0 cpuacct_usage_percpu[5] = 0 cpuacct_usage_percpu[6] = 0 cpuacct_usage_percpu[7] = 0 cpuacct_usage_percpu[8] = 0 cpuacct_usage_percpu[9] = 0 cpuacct_usage_percpu[10] = 0 cpuacct_usage_percpu[11] = 0

i don't know what happen, maybe something i configure is wrong?

And i stop mckernel: [root@localhost sysroot]# ./sbin/mcstop+release.sh error: destroying OS instance 0 error: destroying OS instance 0 error: destroying OS instance 0 error: destroying OS instance 0 error: destroying OS instance 0 error: destroying LWK instance 0 failed [root@localhost sysroot]#

p00380563 avatar Jul 06 '21 06:07 p00380563