ColossalAI
ColossalAI copied to clipboard
[BUG]: Failed to run example
🐛 Describe the bug
Hi team, I installed ColossalAI and would like to run through the opt example in examples/language/opt
, but got an error as: NotImplementedError: python bindings to nullptr storage (e.g., from torch.Tensor._make_wrapper_subclass) are currently unsafe and thus disabled.
when I tried running bash ./run_gemini.sh
.
Not sure if it is related to TensorNVMe, but I've got that installed.
The full message is as below:
+ export BS=16
+ BS=16
+ export MEMCAP=0
+ MEMCAP=0
+ export MODEL=125m
+ MODEL=125m
+ export GPUNUM=1
+ GPUNUM=1
+ mkdir -p ./logs
+ export MODLE_PATH=facebook/opt-125m
+ MODLE_PATH=facebook/opt-125m
+ torchrun --nproc_per_node 1 --master_port 19198 train_gemini_opt.py --mem_cap 0 --model_name_or_path facebook/opt-125m --batch_size 16
+ tee ./logs/colo_125m_bs_16_cap_0_gpu_1.log
/root/anaconda3/lib/python3.9/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor
registered at aten/src/ATen/RegisterSchema.cpp:6
dispatch key: Meta
previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053
new kernel: registered at /dev/null:228 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.)
self.m.impl(name, dispatch_key, fn)
[02/18/23 20:24:27] INFO colossalai - colossalai - INFO:
/root/anaconda3/lib/python3.9/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[02/18/23 20:24:28] INFO colossalai - colossalai - INFO:
/root/anaconda3/lib/python3.9/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /root/anaconda3/lib/python3.9/site-packages/colossalai/initialize.py:116
launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline
parallel size: 1, tensor parallel size: 1
[02/18/23 20:24:29] INFO colossalai - colossalai - INFO: /root/project/exp/ColossalAI/examples/language/opt/train_gemini_opt.py:155
main
INFO colossalai - colossalai - INFO: Model config has been created
INFO colossalai - colossalai - INFO: /root/project/exp/ColossalAI/examples/language/opt/train_gemini_opt.py:170
main
INFO colossalai - colossalai - INFO: Finetune a pre-trained model
searching chunk configuration is completed in 0.22 s.
used number: 119.44 MB, wasted number: 1.50 MB
total wasted percentage is 1.24%
=========================================================================================
No pre-built kernel is found, build and load the cpu_adam kernel during runtime now
=========================================================================================
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.13_cu11.7/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.13709568977355957 seconds
=========================================================================================
No pre-built kernel is found, build and load the fused_optim kernel during runtime now
=========================================================================================
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.13_cu11.7/build.ninja...
Building extension module fused_optim...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_optim...
Time to load fused_optim op: 0.12455606460571289 seconds
Traceback (most recent call last):
File "/root/project/exp/ColossalAI/examples/language/opt/train_gemini_opt.py", line 211, in <module>
main()
File "/root/project/exp/ColossalAI/examples/language/opt/train_gemini_opt.py", line 199, in main
optimizer.step()
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/nn/optimizer/zero_optimizer.py", line 228, in step
ret = self.optim.step(div_scale=combined_scale, *args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 115, in step
self._post_state_init(p)
File "/root/anaconda3/lib/python3.9/site-packages/colossalai/nn/optimizer/nvme_optimizer.py", line 59, in _post_state_init
numel = param.storage().size()
File "/root/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 260, in storage
return torch.TypedStorage(wrap_storage=self._storage(), dtype=self.dtype)
NotImplementedError: python bindings to nullptr storage (e.g., from torch.Tensor._make_wrapper_subclass) are currently unsafe and thus disabled. See https://github.com/pytorch/pytorch/issues/61669 for more details
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29179) of binary: /root/anaconda3/bin/python
Traceback (most recent call last):
File "/root/anaconda3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_gemini_opt.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-02-18_20:24:40
host : Artorias.localdomain
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 29179)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment
- WSL on Windows 10 (Ubuntu 20.04)
- Single GPU RTX 3060
- Torch 1.13.1
- CUDA 11.7
- Python 3.9.13
I got the same Error, and Environment also same
Hi @SilenceGTX @zhangsanfeng86 We have updated a lot. Could you please the latest code? https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples This issue was closed due to inactivity. Thanks.