ColossalAI
ColossalAI copied to clipboard
[BUG]: examples/language/gpt/experiments/pipeline_parallel
🐛 Describe the bug
I can't run this example successfully, with the default like this: Traceback (most recent call last): File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/distributed/rpc/rref_proxy.py", line 11, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 230, in sync_global_worker_rrefs self._initialize_partition() File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 185, in _initialize_partition self.module_partition: nn.Module = partition_fn(*partition_args).to(device) File "/home/guozitao/tools/ColossalAI-main/ColossalAI-main/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 74, in partition module = create_partition_module(pp_rank, stage_num, model, data_kwargs) File "/home/guozitao/tools/ColossalAI-main/ColossalAI-main/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 61, in create_partition_module graph = tracer.trace(root=model, meta_args=meta_args) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/fx/tracer/tracer.py", line 397, in trace self.graph = super().trace(root, concrete_args=concrete_args) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 587, in trace self.create_node('output', 'output', (self.create_arg(fn(*args)),), {}, File "/home/guozitao/tools/ColossalAI-main/ColossalAI-main/examples/language/gpt/experiments/pipeline_parallel/model_zoo.py", line 29, in forward return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0] File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 577, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 573, in forward return _orig_module_call(mod, *args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward transformer_outputs = self.transformer( File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 577, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 573, in forward return _orig_module_call(mod, *args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward outputs = block( File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 577, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 573, in forward return _orig_module_call(mod, *args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward attn_outputs = self.attn( File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 577, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/fx/_symbolic_trace.py", line 573, in forward return _orig_module_call(mod, *args, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 329, in forward attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 184, in _attn attn_weights = attn_weights / torch.full( File "/home/guozitao/anaconda3/envs/colossalai/lib/python3.9/site-packages/colossalai/fx/tracer/tracer.py", line 511, in wrapper return target(*args, **kwargs) TypeError: full() received an invalid combination of arguments - got (list, ColoProxy, device=ColoAttribute, dtype=ColoAttribute), but expected one of:
- (tuple of ints size, Number fill_value, *, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
- (tuple of ints size, Number fill_value, *, tuple of names names, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
Environment
torch 1.12.0+cu113 colossalai 0.2.5
May I know your transformers version? We are exploring this feature on a fixed HF transformers < 4.25.1. Few-used torch ops like torch.full haven't been supported in current auto parallel tracer. I guess this bug is caused by a high version tranformers.
We will support these ops as many as possible before merging into formal ColossalAI toolchains but now we just developped basic ops such as matmul/add and etc. If you want to try this feature, you can downgrade your tranformers version into < 4.25.1, which is fully tested. Thanks for you feedback.
We have updated a lot. This issue was closed due to inactivity. Thanks.