LLMs-Finetuning-Safety
LLMs-Finetuning-Safety copied to clipboard
Error because of `all_reduce` on `float` instead of `torch.Tensor`
When using llama2 fine-tuning in tier-1 notebook with multi-gpu, the code goes into following line https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/8a3b38f11be1c3829e2b0ed379d3661ebc84e7db/llama2/utils/train_utils.py#L127
total_loss
turns out to be float
instead of torch.Tensor
because of L89 and L102
https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/8a3b38f11be1c3829e2b0ed379d3661ebc84e7db/llama2/utils/train_utils.py#L89
https://github.com/LLM-Tuning-Safety/LLMs-Finetuning-Safety/blob/8a3b38f11be1c3829e2b0ed379d3661ebc84e7db/llama2/utils/train_utils.py#L102
This leads to an error. Log:
[rank0]: File "/home/user/workspace/LLMs-Finetuning-Safety/llama2/finetuning.py", line 265, in <module>
[rank0]: fire.Fire(main)
[rank0]: File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/workspace/LLMs-Finetuning-Safety/llama2/finetuning.py", line 248, in main
[rank0]: results = train(
[rank0]: ^^^^^^
[rank0]: File "/home/user/workspace/LLMs-Finetuning-Safety/llama2/utils/train_utils.py", line 127, in train
[rank0]: dist.all_reduce(total_loss, op=dist.ReduceOp.SUM)
[rank0]: File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2195, in all_reduce
[rank0]: _check_single_tensor(tensor, "tensor")
[rank0]: File "/home/user/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 863, in _check_single_tensor
[rank0]: raise TypeError(
[rank0]: TypeError: Invalid function argument. Expected parameter `tensor` of type torch.Tensor
[rank0]: but got <class 'float'> instead.