llama
llama copied to clipboard
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Running into the same error on the 13b and 70b chat models. Using a h100 80GB card. The 7b chat model works fine.
Command (13b):
torchrun --nproc_per_node 2 example_chat_completion.py --ckpt_dir llama-2-13b-chat/ --tokenizer_path tokenizer.model --max_seq_len 4096 --max_batch_size 4
Error:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
File "example_chat_completion.py", line 149, in <module>
fire.Fire(main)
File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example_chat_completion.py", line 20, in main
generator = Llama.build(
File "/home/ubuntu/llama/llama/generation.py", line 69, in build
torch.cuda.set_device(local_rank)
File "/usr/lib/python3/dist-packages/torch/cuda/__init__.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 74007 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 74008) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/bin/torchrun", line 11, in <module>
load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')()
File "/usr/lib/python3/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 344, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-19_16:31:42
host : 209-20-158-162
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 74008)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html```
I faced the same issue in 7B- The client socket has failed to connect to [IN31GFRRL143ZWD.ap.wkglobal.com]:29500 (system error: 10049 - unknown error). Do you know how to solve this?
I'm getting this error too. 7B is the only model I've tried so far, as 70B was a little too big for me.
Hi,
Even I am experiencing the same issue while using 7B model on Jupyter notebook.
Logs attached below for reference.
`> initializing model parallel with size 1
initializing ddp with size 1 initializing pipeline with size 1 Loaded in 151.59 seconds Traceback (most recent call last): File "/home/jupyter/llama2/llama/example_chat_completion.py", line 90, in
fire.Fire(main) File "/opt/conda/envs/llm/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/envs/llm/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/envs/llm/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/jupyter/llama2/llama/example_chat_completion.py", line 73, in main results = generator.chat_completion( File "/home/jupyter/llama2/llama/llama/generation.py", line 270, in chat_completion generation_tokens, generation_logprobs = self.generate( File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/jupyter/llama2/llama/llama/generation.py", line 122, in generate assert max_prompt_len <= params.max_seq_len AssertionError ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2914614) of binary: /opt/conda/envs/llm/bin/python3.10 Traceback (most recent call last): File "/opt/conda/envs/llm/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_chat_completion.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-07-24_06:54:13 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2914614) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(llm) jupyter@umn-20230717-150749:~/llama2/llama$ `
Any thoughts to resolve this issue??
I have solved this error message because I didn't have an AMD or Nvidia graphic card, so I have installed the cpu version by installing this : https://github.com/krychu/llama
instead of https://github.com/facebookresearch/llama
Complete process to install :
- download the original version of Llama from :
https://github.com/facebookresearch/llama
and extract it to allama-main
folder - download th cpu version from :
https://github.com/krychu/llama
and extract it and replace files in thellama-main
folder - run the
download.sh
script in a terminal, passing the URL provided when prompted to start the download - go to the
llama-main
folder - cretate an Python3 env :
python3 -m venv env
and activate it :source env/bin/activate
- install the cpu version of pytorch :
python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
- install dependencies off llama :
python3 -m pip install -e .
- run if you have downloaded llama-2-7b :
torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir llama-2-7b/ \
--tokenizer_path tokenizer.model \
--max_seq_len 128 --max_batch_size 1 #(instead of 4)
Same issue with Windows11 32 GB RAM and RTX3090 24 GB VRAM trying to run 7B. Already tried different versions of CUDA and PyTorch without improvement. CPU is not an option for me. Any ideas, here is my error:
`(llama2env) PS Y:\231125 LLAMA2\llama-main> torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir ..\llama-2-7b-chat\ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6 [2023-11-27 20:17:04,777] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [TROG2020]:29500 (system error: 10049 - Die angeforderte Adresse ist in diesem Kontext ung³ltig.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [TROG2020]:29500 (system error: 10049 - Die angeforderte Adresse ist in diesem Kontext ung³ltig.).
initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "Y:\231125 LLAMA2\llama-main\example_chat_completion.py", line 106, in
fire.Fire(main) File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\example_chat_completion.py", line 37, in main generator = Llama.build( ^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\llama\generation.py", line 116, in build tokenizer = Tokenizer(model_path=tokenizer_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\llama\tokenizer.py", line 24, in init assert os.path.isfile(model_path), model_path AssertionError: tokenizer.model [2023-11-27 20:17:19,804] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 12048) of binary: Y:\231125 LLAMA2\llama2env\Scripts\python.exe Traceback (most recent call last): File " ", line 198, in run_module_as_main File " ", line 88, in run_code File "Y:\231125 LLAMA2\llama2env\Scripts\torchrun.exe_main.py", line 7, in .py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_chat_completion.py FAILEDFile "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-11-27_20:17:19 host : XXX rank : 0 (local_rank: 0) exitcode : 1 (pid: 12048) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`