trafficstars
🐛 Describe the bug
Just run the examples/language/opt/run_clm.py will reproduce the error.
The program crashed with no error information.
After I replace placement_policy as 'cuda'. It is OK.
placement_policy = 'cuda'
chunk_manager = ChunkManager(chunk_size, process_group=pg,
enable_distributed_storage=True,
init_device=GeminiManager.get_default_device(placement_policy))
gemini_manager = GeminiManager(placement_policy, chunk_manager)
model = ZeroDDP(model, gemini_manager)
logger.info(f'{model.__class__.__name__} has been created', ranks=[0])
Environment
colossalai 0.1.8+torch1.12cu11.3
I also tried placement_policy = 'cpu'
It also crashed. The error stack is listed as follows
0%| | 0/444 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
Traceback (most recent call last):
File "run_clm.py", line 575, in
main()
File "run_clm.py", line 528, in main
optimizer.backward(loss)
File "/home/lcfjr/codes/ColossalAI/colossalai/zero/zero_optimizer.py", line 151, in backward
self.module.backward(loss)
File "/home/lcfjr/codes/ColossalAI/colossalai/nn/parallel/data_parallel.py", line 246, in backward
loss.backward()
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/_tensor.py", line 388, in backward
return handle_torch_function(
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/home/lcfjr/codes/ColossalAI/colossalai/tensor/colo_tensor.py", line 171, in torch_function
ret = func(*args, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 130, in backward
outputs = ctx.run_function(*detached_inputs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 674, in custom_forward
return module(*inputs, output_attentions, None)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 315, in forward
hidden_states = self.self_attn_layer_norm(hidden_states)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
0%| | 0/444 [00:06<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2895986) of binary: /home/lcfjr/miniconda3/envs/dev/bin/python3
Traceback (most recent call last):
File "/home/lcfjr/miniconda3/envs/dev/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Encountered the same problem, is there a solution?
After I replace placement_policy as 'cuda'. It is OK.
Got same error, fixed after these changes.