Hi, I'm running llama stack on an ubuntu machine with PA6000 GPU on paperspace; loading the default model. When I turn off quantization I manage to get the server up and running, when I choose fp8 in build I get the following error in llama stack run:

paperspace@pspz9spk7exu:~$ llama stack run lcl --port 5000
Resolved 8 providers in topological order
  Api.models: routing_table
  Api.inference: router
  Api.shields: routing_table
  Api.safety: router
  Api.memory_banks: routing_table
  Api.memory: router
  Api.agents: meta-reference
  Api.telemetry: meta-reference

initializing model parallel with size 1
initializing ddp with size 1
initializing pipeline with size 1

/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/site-packages/torch/_init_.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: -9) local_rank: 0 (pid: 5620) of fn: worker_process_entrypoint (start_method: fork)
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702]   File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702]     self._pc.join(-1)
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702]   File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 170, in join
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702]     raise ProcessExitedException(
E1008 04:49:31.950000 140333789631552 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Process ForkProcess-1:
Traceback (most recent call last):
  File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 175, in launch_dist_group
    elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
  File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/paperspace/anaconda3/envs/llamastack-lcl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
worker_process_entrypoint FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-08_04:49:31
  host      : pspz9spk7exu
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 5620)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 5620
=====================================================

Oct 08 '24 14:10 ShadiCopty

Can you include the run.yaml? Also, which model were you trying to use?

Oct 10 '24 06:10 raghotham

Absolutely: Model: Llama3.1-8B-Instruct

run.yaml.zip

Removing the fp8 gets this stack to work. let me know if you need more info re the system.

Oct 10 '24 14:10 ShadiCopty

fp8 issues have been fixed. @ShadiCopty can you retry when you get a chance? Please re-open if not fixed.

Oct 22 '24 21:10 ashwinb

/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/init.py:1145: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.) _C._set_default_tensor_type(t) E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] failed (exitcode: -9) local_rank: 0 (pid: 3038) of fn: worker_process_entrypoint (start_method: fork) E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last): E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 687, in _poll E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] self._pc.join(-1) E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] raise ProcessExitedException( E1023 03:48:27.577000 3035 anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL Process ForkProcess-1: Traceback (most recent call last): File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/inference/parallel_utils.py", line 281, in launch_dist_group elastic_launch(launch_config, entrypoint=worker_process_entrypoint)( File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/paperspace/anaconda3/envs/llamastack-mylocal/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

worker_process_entrypoint FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-10-23_03:48:27 host : pspz9spk7exu rank : 0 (local_rank: 0) exitcode : -9 (pid: 3038) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 3038

Oct 23 '24 03:10 ShadiCopty

@ashwinb still failing, I removed all of the old installation to be sure, and am using the reference-meta-quantized implementation with fp8.

Oct 23 '24 03:10 ShadiCopty

Quantization (FP8) causing errors

worker_process_entrypoint FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-10-23_03:48:27 host : pspz9spk7exu rank : 0 (local_rank: 0) exitcode : -9 (pid: 3038) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 3038