llama-stack-apps icon indicating copy to clipboard operation
llama-stack-apps copied to clipboard

worker_process_entrypoint FAILED

Open ahsaan-habib opened this issue 7 months ago • 1 comments

I have tried with my ubuntu 22.04 OS but it gives following error.

E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: -9) local_rank: 0 (pid: 30194) of fn: worker_process_entrypoint (start_method: fork) E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last): E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] File "/home/aleya/Work/Habibi/llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] self._pc.join(-1) E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] File "/home/aleya/Work/Habibi/llama/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 170, in join E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] raise ProcessExitedException( E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL Process ForkProcess-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/Habib/llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 175, in launch_dist_group elastic_launch(launch_config, entrypoint=worker_process_entrypoint)( File "/home/Habib/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/Habib/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

====================================================== worker_process_entrypoint FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-07-24_19:33:34 host : Habib-Stealth-15M-A11SDK rank : 0 (local_rank: 0) exitcode : -9 (pid: 30194) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 30194

ahsaan-habib avatar Jul 24 '24 16:07 ahsaan-habib