stylable icon indicating copy to clipboard operation
stylable copied to clipboard

segmentation fault while launching the worker

Open xuexiaxie opened this issue 1 year ago • 1 comments

When I started the worker to use distributed training with the following environment configuration, I got a segmentation error. The error message is as follows:

BytePS launching worker enable NUMA finetune... Command: numactl --physcpubind 0-5,24-29 python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 10

[20:24:40] src/postoffice.cc:63: Creating Van: zmq. group_size=1 [20:24:40] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [20:24:40] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4 [[20:24:40] src/van.cc:581: Bind to [role=worker, ip=192.168.108.230, port=56583, is_recovery=0, aux_id=-1, num_ports=1]20:24:40] src/./zmq_van.h:351: Start ZMQ recv thread

[20:24:40] src/./zmq_van.h:159: Zmq connecting to node [role=scheduler, id=1, ip=192.168.108.228, port=1234, is_recovery=0, aux_id=-1, num_ports=1]. My node is [role=worker, ip=192.168.108.230, port=56583, is_recovery=0, aux_id=-1, num_ports=1] [20:24:40] src/van.cc:673: zeromq 32767 sent: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ [role=worker, ip=192.168.108.230, port=56583, is_recovery=0, aux_id=-1, num_ports=1] } }. NOT DATA MSG! Segmentation fault (core dumped) Traceback (most recent call last): File "/usr/local/bin/bpslaunch", line 4, in import('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch') File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 658, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1438, in run_script exec(code, namespace, namespace) File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 281, in launch_bps() File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 267, in launch_bps join_threads(t) File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 230, in join_threads threads[idx].join() File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 40, in join raise self.exc File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 31, in run self.ret = self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.7/dist-packages/byteps-0.2.5-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 199, in worker stdout=sys.stdout, stderr=sys.stderr, shell=True) File "/usr/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'numactl --physcpubind 0-5,24-29 python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 10' returned non-zero exit status 139.

my envs and command:

export DMLC_ROLE=worker export DMLC_PS_ROOT_URI=192.168.108.228 export DMLC_PS_ROOT_PORT=1234 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=1 export DMLC_INTERFACE=eno1 export NVIDIA_VISIBLE_DEVICES=2 export BYTEPS_FORCE_DISTRIBUTED=1 export PS_VERBOSE=2 bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model vgg16 --num-iters 10

xuexiaxie avatar May 13 '23 11:05 xuexiaxie

看看你在其他所有节点上的envs and command是什么样的

yxwdsb avatar May 15 '23 09:05 yxwdsb