improved-diffusion icon indicating copy to clipboard operation
improved-diffusion copied to clipboard

Train error on ubuntu 22.04

Open youyuanyi opened this issue 2 years ago • 1 comments

OS: Ubuntu 22.04 Graphic: RTX 3090 Python 3.10 mpi4py: 3.5.1 train_bash.sh

#!/bin/bash

MODEL_FLAGS="--image_size 32 --num_channels 128 --num_res_blocks 3 --learn_sigma True --dropout 0.3 --class_cond True "
DIFFUSION_FLAGS="--diffusion_steps 4000 --noise_schedule cosine"
TRAIN_FLAGS="--lr 1e-4 --batch_size 128"

# Train
python scripts/image_train.py --data_dir ../data $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I encountered the following problem while training a diffuison model on cifar-10 datasest. Who also encountered this problem and how to solve it?

A process has executed an operation involving a call
to the fork() system call to create a child process.

As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.

For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE

However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.

You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.

Your job will now abort.

python:699378 terminated with signal 6 at PC=7f8740a96a7c SP=7ffe9c925f50.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f8740a96a7c]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f8740a42476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f8740a287f3]
/lib/x86_64-linux-gnu/libfabric.so.1(+0x76b4e)[0x7f873a321b4e]
/lib/x86_64-linux-gnu/libc.so.6(+0xeafb8)[0x7f8740aeafb8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_fork+0x71)[0x7f8740aea781]
python(+0x27257e)[0x55d9043c957e]
python(+0x185c51)[0x55d9042dcc51]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyObject_FastCallDictTstate+0x569)[0x55d9043049b9]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4910)[0x55d90432af30]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(+0x1a58f6)[0x55d9042fc8f6]
python(_PyObject_FastCallDictTstate+0x30b)[0x55d90430475b]
python(_PyObject_Call_Prepend+0x6a)[0x55d904304d6a]
python(+0x1adf26)[0x55d904304f26]
python(_PyObject_MakeTpCall+0x2f5)[0x55d9042a0ab5]
python(_PyEval_EvalFrameDefault+0x47bc)[0x55d90432addc]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(+0x18fbc7)[0x55d9042e6bc7]
python(+0x18fc3d)[0x55d9042e6c3d]
python(+0x23d931)[0x55d904394931]
python(PyObject_GetIter+0x16)[0x55d9042a1b76]
python(_PyEval_EvalFrameDefault+0x66a7)[0x55d90432ccc7]
python(+0x240615)[0x55d904397615]
python(+0x191f43)[0x55d9042e8f43]
python(+0x185d61)[0x55d9042dcd61]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x868)[0x55d904326e88]
python(_PyFunction_Vectorcall+0x383)[0x55d9042fd553]
python(_PyEval_EvalFrameDefault+0x4b6)[0x55d904326ad6]
python(+0x1a579a)[0x55d9042fc79a]
python(_PyEval_EvalCodeWithName+0x4b)[0x55d9042fd14b]
python(PyEval_EvalCodeEx+0x44)[0x55d9042fd194]
python(PyEval_EvalCode+0x1c)[0x55d9042fd1bc]
python(+0x2525cd)[0x55d9043a95cd]
python(+0x276196)[0x55d9043cd196]
python(+0x120091)[0x55d904277091]
python(PyRun_SimpleFileExFlags+0x1c1)[0x55d9043d3ee1]
python(Py_RunMain+0x398)[0x55d9043d45b8]
python(Py_BytesMain+0x39)[0x55d9043d4729]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f8740a29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f8740a29e40]
python(+0x203667)[0x55d90435a667]

youyuanyi avatar Nov 01 '23 01:11 youyuanyi

I have the same problem.

DailyVy avatar Mar 19 '24 06:03 DailyVy