LLMs-Finetuning-Safety icon indicating copy to clipboard operation
LLMs-Finetuning-Safety copied to clipboard

Error because of `all_reduce` on `float` instead of `torch.Tensor`

Open ain-soph opened this issue 9 months ago • 0 comments

When using llama2 fine-tuning in tier-1 notebook with multi-gpu, the code goes into following line https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/8a3b38f11be1c3829e2b0ed379d3661ebc84e7db/llama2/utils/train_utils.py#L127

total_loss turns out to be float instead of torch.Tensor because of L89 and L102 https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/8a3b38f11be1c3829e2b0ed379d3661ebc84e7db/llama2/utils/train_utils.py#L89 https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/8a3b38f11be1c3829e2b0ed379d3661ebc84e7db/llama2/utils/train_utils.py#L102

This leads to an error. Log:

[rank0]:   File "/home/user/workspace/LLMs-Finetuning-Safety/llama2/finetuning.py", line 265, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/workspace/LLMs-Finetuning-Safety/llama2/finetuning.py", line 248, in main
[rank0]:     results = train(
[rank0]:               ^^^^^^
[rank0]:   File "/home/user/workspace/LLMs-Finetuning-Safety/llama2/utils/train_utils.py", line 127, in train
[rank0]:     dist.all_reduce(total_loss, op=dist.ReduceOp.SUM)
[rank0]:   File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2195, in all_reduce
[rank0]:     _check_single_tensor(tensor, "tensor")
[rank0]:   File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 863, in _check_single_tensor
[rank0]:     raise TypeError(
[rank0]: TypeError: Invalid function argument. Expected parameter `tensor` of type torch.Tensor
[rank0]:              but got <class 'float'> instead.

ain-soph avatar May 05 '24 23:05 ain-soph