[DLIO_PROFILER ERROR]: signal caught 6 if benchmark is run with RDMA mount and data_loader: dali
mount: IP:/ifs on /mnt/1/ifs type nfs (rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,forcerdirplus,proto=rdma,nconnect=24,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=IP,mountvers=3,mountproto=tcp,local_lock=none,addr=IP)
h# cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.1 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.1 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy
h# ./benchmark.sh run --hosts HOST --workload resnet50 --accelerator-type h100 --num-accelerators 2 --results-dir resultsdir-$(date +"%d-%m-%Y") --param data set.num_files_train=1200 --param dataset.data_folder=/mnt/1/ifs/data/rosnet50_05_04_2024_x02 [INFO] 2024-04-08T16:08:03.406678 Profiling DLIO /root/aan/storage/resultsdir-08-04-2024/trace-0-of-2.pfw [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189] [INFO] 2024-04-08T16:08:03.407010 Running DLIO with 2 process(es) [/root/aan/storage/dlio_benchmark/dlio_benchmark/main.py:98] [INFO] 2024-04-08T16:08:03.635388 Max steps per epoch: 1876 = 1251 * 1200 / 400 / 2 (samples per file * num files / batch size / comm size) [/root/aan/storage/dlio_benchmark/dlio_benchm ark/main.py:322] [INFO] 2024-04-08T16:08:07.053113 Starting epoch 1: 1876 steps expected [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:128] [INFO] 2024-04-08T16:08:07.053269 Starting block 1 [/root/aan/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:198] A process has executed an operation involving a call to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors.
For the libfabric EFA provider to work safely when fork() is called, you will need to set the following environment variable: RDMAV_FORK_SAFE
However, setting this environment variable can result in signficant performance impact to your application due to increased cost of memory registration.
You may want to check with your application vendor to see if an application-level alternative (of not using fork) exists.
Your job will now abort.
Your job will now abort. [DLIO_PROFILER ERROR]: signal caught 6 /usr/local/lib/python3.10/dist-packages/dlio_profiler_py.cpython-310-x86_64-linux-gnu.so(+0x30325) [0x7f8db87ca325] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f8e0f8fd520] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f8e0f9519fc] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f8e0f8fd476] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3) [0x7f8e0f8e37f3] /lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e) [0x7f8db8a76b4e] /lib/x86_64-linux-gnu/libc.so.6(+0xeaf48) [0x7f8e0f9a5f48] /lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71) [0x7f8e0f9a5711] python3(+0x287a6e) [0x56136209ba6e] python3(+0x157a3e) [0x561361f6ba3e] python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c] python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14] python3(+0x164a64) [0x561361f78a64] python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c] python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c] python3(_PyObject_FastCallDictTstate+0xc4) [0x561361f63c14] python3(+0x164b05) [0x561361f78b05] python3(_PyObject_MakeTpCall+0x1fc) [0x561361f64a1c] python3(_PyEval_EvalFrameDefault+0x64e6) [0x561361f5d096] python3(+0x1687f1) [0x561361f7c7f1] python3(_PyEval_EvalFrameDefault+0x614a) [0x561361f5ccfa] python3(+0x1687f1) [0x561361f7c7f1] python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(_PyEval_EvalFrameDefault+0x8ac) [0x561361f5745c] python3(_PyFunction_Vectorcall+0x7c) [0x561361f6e9fc] python3(PyObject_Call+0x122) [0x561361f7d492] python3(_PyEval_EvalFrameDefault+0x2a27) [0x561361f595d7] python3(+0x1687f1) [0x561361f7c7f1] python3(_PyEval_EvalFrameDefault+0x198c) [0x561361f5853c] ^C[mpiexec@mpl078d] Sending Ctrl-C to processes as requested [mpiexec@mpl078d] Press Ctrl-C again to force abort [DLIO_PROFILER ERROR]: signal caught 2 [DLIO_PROFILER ERROR]: signal caught 2 ^CCtrl-C caught... cleaning up processes
with RDMAV_FORK_SAFE=1 benchmark is running without exception but no load is generated