storage icon indicating copy to clipboard operation
storage copied to clipboard

fork error when running on Kubernetes

Open uprush opened this issue 2 years ago • 2 comments

Hi,

The benchmark failed with the following error when running on Kubernetes. I was able to workaround it by setting environment variable RDMAV_FORK_SAFE=0, but not sure whether there is any performance impact and other issues.

root@mlperf-storage:/mlperf/storage# ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687
[INFO] 2023-12-12T07:16:13.865342 Running DLIO with 8 process(es) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:104]
[INFO] 2023-12-12T07:16:13.865599 Reading workload YAML config file '/mlperf/storage/storage-conf/workload/unet3d.yaml' [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:106]
[INFO] 2023-12-12T07:16:13.979505 Max steps per epoch: 146 = 1 * 4687 / 4 / 8 (samples per file * num files / batch size / comm size) [/mlperf/storage/dlio_benchmark/src/dlio_benchmark.py:274]
[INFO] 2023-12-12T07:16:13.979733 Starting epoch 1: 146 steps expected [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:129]
[INFO] 2023-12-12T07:16:13.980126 Prefetch size is 0; a default prefetch factor of 2 will be set to Torch DataLoader. [/mlperf/storage/dlio_benchmark/src/reader/torch_data_loader_reader.py:123]
[INFO] 2023-12-12T07:16:13.980436 Starting block 1 [/mlperf/storage/dlio_benchmark/src/utils/statscounter.py:195]
A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python3:1099 terminated with signal 6 at PC=7f9034457a7c SP=7ffcd0aa49c0.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f9034457a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f9034403476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f90343e97f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f8ea631eb4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f90344abfb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f90344ab781]

uprush avatar Dec 12 '23 10:12 uprush

Use RDMAV_FORK_SAFE=1 ./benchmark.sh run --workload unet3d --num-accelerators 8 --results-dir /mnt/fb1/unet3d_results --param dataset.data_folder=/mnt/fb1/unet3d_data --param dataset.num_subfolders_train=16 --param dataset.num_files_train=4687

johnugeorge avatar Feb 20 '24 12:02 johnugeorge

Any further information in how to run without setting RDMAV_FORK_SAFE=1 ? There is apparently a performance penalty when running in this state.

marktheunissen avatar Oct 04 '24 02:10 marktheunissen