stylable icon indicating copy to clipboard operation
stylable copied to clipboard

Error at `byteps.shutdown`

Open eric-haibin-lin opened this issue 4 years ago • 3 comments

Describe the bug

test.py

import byteps.torch; byteps.torch.init(); 
import time; time.sleep(10); 
byteps.torch.shutdown();
export NVIDIA_VISIBLE_DEVICES=0; 
export DMLC_NUM_WORKER=1; 
export DMLC_WORKER_ID=0; 
export DMLC_ROLE=worker; 
export BYTEPS_LOG_LEVEL=DEBUG; 
python3 /usr/local/byteps/launcher/launch.py python3 test.py

Error log

BytePS launching worker
[2019-12-12 23:49:07.106117: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-12 23:49:07.106239: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2019-12-12 23:49:07.106283: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2019-12-12 23:49:07.106346: D byteps/common/communicator.cc:121] This is ROOT device, rank=0, all sockets create successfully
[2019-12-12 23:49:07.106356: D byteps/common/global.cc:118] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-12 23:49:07.106364: D byteps/common/global.cc:150] Number of worker=1, launching non-distributed job
[2019-12-12 23:49:07.106403: D byteps/common/communicator.cc:164] Listening on socket 0
[2019-12-12 23:49:07.595886: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-12 23:49:07.595923: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2019-12-12 23:49:07.595930: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-12 23:49:07.596012: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2019-12-12 23:49:07.596045: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2019-12-12 23:49:07.596084: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2019-12-12 23:49:07.596092: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2019-12-12 23:49:07.596130: D byteps/common/communicator.cc:164] Listening on socket 0
[2019-12-12 23:49:07.827777: D byteps/common/nccl_manager.cc:104] root nccl_id is 144134618889977858
[2019-12-12 23:49:08.266185: D byteps/common/global.cc:225] Create schedule queue 0
[2019-12-12 23:49:08.266208: D byteps/common/global.cc:225] Create schedule queue 1
[2019-12-12 23:49:08.266215: D byteps/common/global.cc:225] Create schedule queue 2
[2019-12-12 23:49:08.266223: D byteps/common/global.cc:225] Create schedule queue 3
[2019-12-12 23:49:08.266229: D byteps/common/global.cc:225] Create schedule queue 4
[2019-12-12 23:49:08.266236: D byteps/common/global.cc:225] Create schedule queue 5
[2019-12-12 23:49:08.266242: D byteps/common/global.cc:225] Create schedule queue 6
[2019-12-12 23:49:08.266249: D byteps/common/global.cc:225] Create schedule queue 7
[2019-12-12 23:49:08.266255: D byteps/common/global.cc:225] Create schedule queue 8
[2019-12-12 23:49:08.266261: D byteps/common/global.cc:225] Create schedule queue 9
[2019-12-12 23:49:08.266268: D byteps/common/global.cc:233] Inited rank=0 local_rank=0 size=1 local_size=1 worker_id=0
[2019-12-12 23:49:08.266350: D byteps/common/global.cc:265] Started 2 background threads. rank=0
[2019-12-12 23:49:18.276738: D byteps/common/global.cc:281] Shutdown BytePS: start to clean the resources (rank=0)
[2019-12-12 23:49:18.276960: D byteps/common/shared_memory.h:45] Clear shared memory: all BytePS shared memory released/unregistered.
[2019-12-12 23:49:19.103782: D byteps/common/communicator.cc:203] listen thread joined (rank=0)
[2019-12-12 23:49:19.595785: D byteps/common/communicator.cc:203] listen thread joined (rank=0)
[2019-12-12 23:49:19.595845: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2019-12-12 23:49:19.595881: D byteps/common/communicator.h:108] Clear BytePSCommSocket (rank=0)
[2019-12-12 23:49:19.595890: D byteps/common/nccl_manager.h:63] Clear NcclManager
[2019-12-12 23:49:19.595899: D byteps/common/global.cc:334] Shutdown BytePS: all BytePS resources has been cleaned (rank=0)
[2019-12-12 23:49:19.595907: D byteps/common/operations.cc:80] BytePS has been completely shutdown now
[2019-12-12 23:49:19.596112: D byteps/common/global.cc:281] Shutdown BytePS: start to clean the resources (rank=0)
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
Aborted (core dumped)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 47, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 test.py' returned non-zero exit status 134

To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: Linux
  • GCC version: 4.9
  • CUDA and NCCL version:
  • Framework (TF, PyTorch, MXNet): Pytorch and MXNet

Additional context

eric-haibin-lin avatar Dec 12 '19 23:12 eric-haibin-lin

This error is caused by the using custom. BytePs used atexit package in BytePSBasics(Wrapper class for the basic BytePS API.). The atexit module defines functions to register and unregister cleanup functions. You can go to byteps/common/init.py/ByePSBasics.init() to see the usage.

Therefore, user does not need to call bps.shutdown(). (If user does, the resources will be cleaned twice, and user will be blocked at the second time of cleaning.)

ChaokunChang avatar Jan 06 '20 10:01 ChaokunChang

@bobzhuyb @ymjiang maybe shutdown should not be exposed as user facing public APIs?

eric-haibin-lin avatar Jan 15 '20 19:01 eric-haibin-lin

@eric-haibin-lin I prefer to make shutdown more robust so that the second time it is called has no effects. Just like init, which can be called multiple times but in fact it just runs once. We will see whether it's straightforward to do.

bobzhuyb avatar Jan 15 '20 19:01 bobzhuyb