ColossalAI [BUG]: ZeRO3 offload use all GPU memory

[BUG]: ZeRO3 offload use all GPU memory

Open MikeChenfu opened this issue 1 year ago • 0 comments

🐛 Describe the bug

Hello, I am training OPT model on the A100 GPUs. I found it used 76GB GPU memory when I use auto mode and set gpu_margin_mem_ratio as 0. If I use cpu mode, it only takes about 15GB. In my understanding, both two methods should use the same GPU memory.

Also I got different connection errors when I use auto mode and set the gpu_margin_mem_ratio as non-zero like 0.2 within two nodes. It works well on the single node but seems gpu_margin_mem_ratio value does not control GPU memory usage.

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3463 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to send a keep-alive heartbeat to the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'n176-050-078.byted.org_3158_0' has failed to shutdown the rendezvous 'colossalai-default-job' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1079, in num_nodes_waiting
    self._state_holder.sync()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    get_response = self._backend.get_state()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
    ) from exc
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

[E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807843 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807843 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31538) of binary: /usr/bin/python3
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 309.7293393611908 seconds
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/agent/server/api.py", line 911, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train.py FAILED
------------------------------------------------------

Environment

### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.2.7
PyTorch version: 1.12.1
System CUDA version: 11.3
CUDA version required by PyTorch: 11.3

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the System CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
3. If the CUDA version required by PyTorch is N/A, you probably did not install a CUDA-compatible PyTorch. This value is give by torch.version.cuda and you can go to https://pytorch.org/get-started/locally/ to download the correct version.

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: x
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

Note:
1. The table above checks the version compatibility of the libraries/tools in the current environment
   - PyTorch version mistach: whether the PyTorch version in the current environment is compatible with the PyTorch version used for AOT compilation
   - System and PyTorch CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version required by PyTorch
   - System and Colossal-AI CUDA version match: whether the CUDA version in the current environment is compatible with the CUDA version used for AOT compilation

Mar 17 '23 05:03 MikeChenfu

ColossalAI ColossalAI copied to clipboard

[BUG]: ZeRO3 offload use all GPU memory

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard