LLaMA-Adapter
LLaMA-Adapter copied to clipboard
Error when running example.py
Hi, I want to run example.py in windows 11, but I get weird errors (sockets):
(llama_adapter) C:\Users\jjovan\llama\ai\LLaMA-Adapter>python -m torch.distributed.run --nproc_per_node 1 example.py --ckpt_dir .\7B --tokenizer_path .\7B\tokenizer.model --adapter_path .\7B
NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - unknown error).
Traceback (most recent call last):
File "example.py", line 119, in
fire.Fire(main)
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 90, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 35, in setup_model_parallel
torch.distributed.init_process_group("nccl")
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\distributed_c10d.py", line 895, in init_process_group
default_pg = _new_process_group_helper(
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\distributed_c10d.py", line 998, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 30096) of binary: C:\Users\jjovan.conda\envs\llama_adapter\python.exe
Traceback (most recent call last):
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\run.py", line 798, in
main()
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init .py", line 346, in wrapper
return f(*args, **kwargs)
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\run.py", line 794, in main
run(args)
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\jjovan.conda\envs\llama_adapter\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
example.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-04-19_10:13:02 host : jjovan.smart-com.si rank : 0 (local_rank: 0) exitcode : 1 (pid: 30096) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Any idea?