LLaMA-Adapter icon indicating copy to clipboard operation
LLaMA-Adapter copied to clipboard

Error when running example.py

Open reddiamond1234 opened this issue 1 year ago • 1 comments

Hi, I want to run example.py in windows 11, but I get weird errors (sockets):

(llama_adapter) C:\Users\jjovan\llama\ai\LLaMA-Adapter>python -m torch.distributed.run --nproc_per_node 1 example.py --ckpt_dir .\7B --tokenizer_path .\7B\tokenizer.model --adapter_path .\7B
NOTE: Redirects are currently not supported in Windows or MacOs. [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error). [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error). [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error). [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error). Traceback (most recent call last): File "example.py", line 119, in fire.Fire(main) File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "example.py", line 90, in main local_rank, world_size = setup_model_parallel() File "example.py", line 35, in setup_model_parallel torch.distributed.init_process_group("nccl") File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group default_pg = _new_process_group_helper( File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30096) of binary: C:\Users\jjovan.conda\envs\llama_adapter\python.exe Traceback (most recent call last): File "C:\Users\jjovan.conda\envs\llama_adapter\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\jjovan.conda\envs\llama_adapter\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\run.py", line 798, in main() File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 346, in wrapper return f(*args, **kwargs) File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-19_10:13:02 host : jjovan.smart-com.si rank : 0 (local_rank: 0) exitcode : 1 (pid: 30096) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Any idea?

reddiamond1234 avatar Apr 19 '23 08:04 reddiamond1234