improved-diffusion
improved-diffusion copied to clipboard
Train error on ubuntu 22.04
OS: Ubuntu 22.04 Graphic: RTX 3090 Python 3.10 mpi4py: 3.5.1 train_bash.sh
#!/bin/bash
MODEL_FLAGS="--image_size 32 --num_channels 128 --num_res_blocks 3 --learn_sigma True --dropout 0.3 --class_cond True "
DIFFUSION_FLAGS="--diffusion_steps 4000 --noise_schedule cosine"
TRAIN_FLAGS="--lr 1e-4 --batch_size 128"
# Train
python scripts/image_train.py --data_dir ../data $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
I encountered the following problem while training a diffuison model on cifar-10 datasest. Who also encountered this problem and how to solve it?
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
python:699378 terminated with signal 6 at PC=7f8740a96a7c SP=7ffe9c925f50. Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f8740a96a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f8740a42476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f8740a287f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f873a321b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f8740aeafb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f8740aea781]
python(+0x27257e)[0x55d9043c957e]
python(+0x185c51)[0x55d9042dcc51]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyObject_FastCallDictTstate+0x569)[0x55d9043049b9]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(+0x1a58f6)[0x55d9042fc8f6]
python(_PyObject_FastCallDictTstate+0x30b)[0x55d90430475b]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(+0x18fbc7)[0x55d9042e6bc7]
python(+0x18fc3d)[0x55d9042e6c3d]
python(+0x23d931)[0x55d904394931]
python(PyObject_GetIter+0x16)[0x55d9042a1b76]
python(_PyEval_EvalFrameDefault+0x66a7)[0x55d90432ccc7]
python(+0x240615)[0x55d904397615]
python(+0x191f43)[0x55d9042e8f43]
python(+0x185d61)[0x55d9042dcd61]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(+0x1a579a)[0x55d9042fc79a]
python(_PyEval_EvalCodeWithName+0x4b)[0x55d9042fd14b]
python(PyEval_EvalCodeEx+0x44)[0x55d9042fd194]
python(PyEval_EvalCode+0x1c)[0x55d9042fd1bc]
python(+0x2525cd)[0x55d9043a95cd]
python(+0x276196)[0x55d9043cd196]
python(+0x120091)[0x55d904277091]
python(PyRun_SimpleFileExFlags+0x1c1)[0x55d9043d3ee1]
python(Py_RunMain+0x398)[0x55d9043d45b8]
python(Py_BytesMain+0x39)[0x55d9043d4729]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f8740a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f8740a29e40]
python(+0x203667)[0x55d90435a667]
I have the same problem.