stanford_alpaca
stanford_alpaca copied to clipboard
[Windows]: RuntimeError: Distributed package doesn't have NCCL built in
Hi, i try to run train.py in Windows. Help me please solve the problem.
System parameters
12th Gen Intel(R) Core(TM) i5-12600KF 3.70 GHz 32 GB Cuda 11.8 Windows 11 Pro Python 3.10.11
Command:
torchrun --nproc_per_node=1 train.py --model_name_or_path "D:\torrents\LLaMA\models\Alpaca_7B.bin" --data_path "D:\torrents\LLaMA\train_data\alpaca_protocol_train_data.json" --bf16 True --output_dir "D:\torrents\LLaMA\models\trained" --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True
Error 1:
NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-IECM8DM]:29500 (system error: 10049 - ...).
Traceback 1
Traceback (most recent call last):
File "D:\torrents\Stanford_Alpaca\stanford_alpaca\train.py", line 222, in
Error 2
RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 920468) of binary: C:\Users\User\AppData\Local\Programs\Python\Python310\python.exe
Traceback 2
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\User\AppData\Local\Programs\Python\Python310\Scripts\torchrun.exe_main.py", line 7, in
Similar error in another repository
I found a similar error in another repository: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/issues/65 As far as I understand, this happens because NCCL does not work in Windows.
As a solution, they suggest setting environment variables: PL_TORCH_DISTRIBUTED_BACKEND = gloo.
This solution did not work for me, but another solution was proposed. There are need to write in the code: if sys.platform == "win32": os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"