stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

[Windows]: RuntimeError: Distributed package doesn't have NCCL built in

Open SkibaSAY opened this issue 2 years ago • 0 comments

Hi, i try to run train.py in Windows. Help me please solve the problem.

System parameters

12th Gen Intel(R) Core(TM) i5-12600KF 3.70 GHz 32 GB Cuda 11.8 Windows 11 Pro Python 3.10.11

Command:

torchrun --nproc_per_node=1 train.py --model_name_or_path "D:\torrents\LLaMA\models\Alpaca_7B.bin" --data_path "D:\torrents\LLaMA\train_data\alpaca_protocol_train_data.json" --bf16 True --output_dir "D:\torrents\LLaMA\models\trained" --num_train_epochs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 True

Error 1:

NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-IECM8DM]:29500 (system error: 10049 - ...).

Traceback 1

Traceback (most recent call last): File "D:\torrents\Stanford_Alpaca\stanford_alpaca\train.py", line 222, in train() File "D:\torrents\Stanford_Alpaca\stanford_alpaca\train.py", line 184, in train model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\hf_argparser.py", line 346, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 113, in init File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\training_args.py", line 1340, in post_init and (self.device.type != "cuda") File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\training_args.py", line 1764, in device return self._setup_devices File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\generic.py", line 54, in get cached = self.fget(obj) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\training_args.py", line 1695, in _setup_devices self.distributed_state = PartialState(backend=self.ddp_backend) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\state.py", line 191, in init torch.distributed.init_process_group(backend=self.backend, **kwargs) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in")

Error 2

RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 920468) of binary: C:\Users\User\AppData\Local\Programs\Python\Python310\python.exe

Traceback 2

Traceback (most recent call last): File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return run_code(code, main_globals, None, File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Users\User\AppData\Local\Programs\Python\Python310\Scripts\torchrun.exe_main.py", line 7, in File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 346, in wrapper return f(*args, **kwargs) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Similar error in another repository

I found a similar error in another repository: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion/issues/65 As far as I understand, this happens because NCCL does not work in Windows.

As a solution, they suggest setting environment variables: PL_TORCH_DISTRIBUTED_BACKEND = gloo.

This solution did not work for me, but another solution was proposed. There are need to write in the code: if sys.platform == "win32": os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

SkibaSAY avatar Jul 11 '23 10:07 SkibaSAY