ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

Open balcklive opened this issue 1 year ago • 0 comments

Is there an existing issue for this bug?

  • [X] I have searched the existing issues

🐛 Describe the bug

[rank0]: Traceback (most recent call last): [rank0]: File "/workspace/WorkPlace/scripts/train_sft.py", line 359, in [rank0]: train(args) [rank0]: File "/workspace/WorkPlace/scripts/train_sft.py", line 288, in train [rank0]: trainer.fit( [rank0]: File "/workspace/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit [rank0]: self._train(epoch) [rank0]: File "/workspace/ColossalAI/applications/ColossalChat/coati/trainer/sft.py", line 133, in _train [rank0]: outputs = self.model( [rank0]: ^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/colossalai/interface/model.py", line 25, in forward [rank0]: return self.module(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward [rank0]: else self._run_ddp_forward(*inputs, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward [rank0]: return self.module(*inputs, **kwargs) # type: ignore[index] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward [rank0]: outputs = self.model( [rank0]: ^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1048, in forward [rank0]: layer_outputs = self._gradient_checkpointing_func( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_compile.py", line 31, in inner [rank0]: return disable_fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn [rank0]: return fn(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint [rank0]: ret = function(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward [rank0]: hidden_states, self_attn_weights, present_key_value = self.self_attn( [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 663, in forward [rank0]: query_states = self.q_proj(hidden_states) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 117, in forward [rank0]: return F.linear(input, self.weight, self.bias) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16 [rank1]: Traceback (most recent call last): [rank1]: File "/workspace/WorkPlace/scripts/train_sft.py", line 359, in [rank1]: train(args) [rank1]: File "/workspace/WorkPlace/scripts/train_sft.py", line 288, in train [rank1]: trainer.fit( [rank1]: File "/workspace/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit [rank1]: self._train(epoch) [rank1]: File "/workspace/ColossalAI/applications/ColossalChat/coati/trainer/sft.py", line 133, in _train [rank1]: outputs = self.model( [rank1]: ^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/colossalai/interface/model.py", line 25, in forward [rank1]: return self.module(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward [rank1]: else self._run_ddp_forward(*inputs, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward [rank1]: return self.module(*inputs, **kwargs) # type: ignore[index] [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward [rank1]: outputs = self.model( [rank1]: ^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1048, in forward [rank1]: layer_outputs = self._gradient_checkpointing_func( [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/_compile.py", line 31, in inner [rank1]: return disable_fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn [rank1]: return fn(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint [rank1]: ret = function(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward [rank1]: hidden_states, self_attn_weights, present_key_value = self.self_attn( [rank1]: ^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 663, in forward [rank1]: query_states = self.q_proj(hidden_states) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank1]: return self._call_impl(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank1]: return forward_call(*args, **kwargs) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 117, in forward [rank1]: return F.linear(input, self.weight, self.bias) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16

Environment

ubuntu20.04 ColossalAI/applications/ColossalChat/examples/training_scripts/train_sft.sh training command: colossalai run --nproc_per_node $GPU_COUNT --master_port 31312 --hostfile $WORKSPACE_DIR/hostfile $WORKSPACE_DIR/scripts/train_sft.py
--pretrain $PRETRAINED_MODEL_PATH
--tokenizer_dir $PRETRAINED_TOKENIZER_PATH
--save_interval 2000
--dataset ${dataset[@]}
--plugin ddp
--batch_size 1
--max_epochs 1
--accumulation_steps 1
--lr 5e-5
--max_len 4096
--grad_checkpoint
--save_path $SAVE_DIR
--config_file $CONFIG_FILE
--log_dir $LOG_DIR --lora_config $WORKSPACE_DIR/chat_template/lora_conf.json

balcklive avatar Dec 25 '24 02:12 balcklive