stylable
stylable copied to clipboard
Error at `byteps.shutdown`
Describe the bug
test.py
import byteps.torch; byteps.torch.init();
import time; time.sleep(10);
byteps.torch.shutdown();
export NVIDIA_VISIBLE_DEVICES=0;
export DMLC_NUM_WORKER=1;
export DMLC_WORKER_ID=0;
export DMLC_ROLE=worker;
export BYTEPS_LOG_LEVEL=DEBUG;
python3 /usr/local/byteps/launcher/launch.py python3 test.py
Error log
BytePS launching worker
[2019-12-12 23:49:07.106117: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-12 23:49:07.106239: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2019-12-12 23:49:07.106283: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2019-12-12 23:49:07.106346: D byteps/common/communicator.cc:121] This is ROOT device, rank=0, all sockets create successfully
[2019-12-12 23:49:07.106356: D byteps/common/global.cc:118] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-12 23:49:07.106364: D byteps/common/global.cc:150] Number of worker=1, launching non-distributed job
[2019-12-12 23:49:07.106403: D byteps/common/communicator.cc:164] Listening on socket 0
[2019-12-12 23:49:07.595886: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-12 23:49:07.595923: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2019-12-12 23:49:07.595930: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-12 23:49:07.596012: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2019-12-12 23:49:07.596045: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2019-12-12 23:49:07.596084: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2019-12-12 23:49:07.596092: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2019-12-12 23:49:07.596130: D byteps/common/communicator.cc:164] Listening on socket 0
[2019-12-12 23:49:07.827777: D byteps/common/nccl_manager.cc:104] root nccl_id is 144134618889977858
[2019-12-12 23:49:08.266185: D byteps/common/global.cc:225] Create schedule queue 0
[2019-12-12 23:49:08.266208: D byteps/common/global.cc:225] Create schedule queue 1
[2019-12-12 23:49:08.266215: D byteps/common/global.cc:225] Create schedule queue 2
[2019-12-12 23:49:08.266223: D byteps/common/global.cc:225] Create schedule queue 3
[2019-12-12 23:49:08.266229: D byteps/common/global.cc:225] Create schedule queue 4
[2019-12-12 23:49:08.266236: D byteps/common/global.cc:225] Create schedule queue 5
[2019-12-12 23:49:08.266242: D byteps/common/global.cc:225] Create schedule queue 6
[2019-12-12 23:49:08.266249: D byteps/common/global.cc:225] Create schedule queue 7
[2019-12-12 23:49:08.266255: D byteps/common/global.cc:225] Create schedule queue 8
[2019-12-12 23:49:08.266261: D byteps/common/global.cc:225] Create schedule queue 9
[2019-12-12 23:49:08.266268: D byteps/common/global.cc:233] Inited rank=0 local_rank=0 size=1 local_size=1 worker_id=0
[2019-12-12 23:49:08.266350: D byteps/common/global.cc:265] Started 2 background threads. rank=0
[2019-12-12 23:49:18.276738: D byteps/common/global.cc:281] Shutdown BytePS: start to clean the resources (rank=0)
[2019-12-12 23:49:18.276960: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2019-12-12 23:49:19.103782: D byteps/common/communicator.cc:203] listen thread joined (rank=0)
[2019-12-12 23:49:19.595785: D byteps/common/communicator.cc:203] listen thread joined (rank=0)
[2019-12-12 23:49:19.595845: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2019-12-12 23:49:19.595881: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2019-12-12 23:49:19.595890: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2019-12-12 23:49:19.595899: D byteps/common/global.cc:334] Shutdown BytePS: all BytePS resources has been cleaned (rank=0)
[2019-12-12 23:49:19.595907: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2019-12-12 23:49:19.596112: D byteps/common/global.cc:281] Shutdown BytePS: start to clean the resources (rank=0)
terminate called after throwing an instance of 'std::system_error'
what(): Invalid argument
Aborted (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/byteps/launcher/launch.py", line 47, in worker
subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 test.py' returned non-zero exit status 134
To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4. See error
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- OS: Linux
- GCC version: 4.9
- CUDA and NCCL version:
- Framework (TF, PyTorch, MXNet): Pytorch and MXNet
Additional context
This error is caused by the using custom. BytePs used atexit package in BytePSBasics(Wrapper class for the basic BytePS API.). The atexit module defines functions to register and unregister cleanup functions. You can go to byteps/common/init.py/ByePSBasics.init() to see the usage.
Therefore, user does not need to call bps.shutdown(). (If user does, the resources will be cleaned twice, and user will be blocked at the second time of cleaning.)
@bobzhuyb @ymjiang maybe shutdown
should not be exposed as user facing public APIs?
@eric-haibin-lin I prefer to make shutdown
more robust so that the second time it is called has no effects. Just like init
, which can be called multiple times but in fact it just runs once. We will see whether it's straightforward to do.