llama-stack-apps
llama-stack-apps copied to clipboard
Error: Failed to initialize the TMA descriptor 801
Good day everyone, I am trying to run llama agentic system on RTX4090 with FP8 Quantization for the inference model and meta-llama/Llama-Guard-3-8B-INT8 for the Guard. WIth sufficiently small max_seq_len everything fits into 24GB VRAM and I can start inference server, and chat app. However as soon I send message in the chat I get the following error: "Error: Failed to initialize the TMA descriptor 801".
(venv) trainer@pc-aiml:~/.llama$ llama inference start --disable-ipv6
/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/utils.py:43: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
initialize(config_path=relative_path)
Loading config from : /home/trainer/.llama/configs/inference.yaml
Yaml config:
------------------------
inference_config:
impl_config:
impl_type: inline
checkpoint_config:
checkpoint:
checkpoint_type: pytorch
checkpoint_dir: /home/trainer/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original
tokenizer_path: /home/trainer/.llama/checkpoints/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model
model_parallel_size: 1
quantization_format: bf16
quantization:
type: fp8
torch_seed: null
max_seq_len: 2048
max_batch_size: 1
------------------------
Listening on 0.0.0.0:5000
INFO: Started server process [20033]
INFO: Waiting for application startup.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
_C._set_default_tensor_type(t)
Using efficient FP8 operators in FBGEMM.
Quantizing fp8 weights from bf16...
Loaded in 7.05 seconds
Finished model load YES READY
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
INFO: 127.0.0.1:55838 - "POST /inference/chat_completion HTTP/1.1" 200 OK
TMA Desc Addr: 0x7ffdd6221440
format 0
dim 3
gmem_address 0x7eb74f4bde00
globalDim (4096,53,1,1,1)
globalStrides (1,4096,0,0,0)
boxDim (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr: 0x7ffdd6221440
format 0
dim 3
gmem_address 0x7eb3ea000000
globalDim (4096,14336,1,1,1)
globalStrides (1,4096,0,0,0)
boxDim (128,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr: 0x7ffdd6221440
format 9
dim 3
gmem_address 0x7eb3e9c00000
globalDim (14336,53,1,1,1)
globalStrides (2,28672,0,0,0)
boxDim (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
TMA Desc Addr: 0x7ffdd6221440
format 9
dim 3
gmem_address 0x7eb3e9c00000
globalDim (14336,53,1,1,1)
globalStrides (2,28672,0,0,0)
boxDim (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 2
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 801
[debug] got exception cutlass cannot initialize
Traceback (most recent call last):
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 80, in retrieve_requests
for obj in out:
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/generation.py", line 287, in chat_completion
yield from self.generate(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/generation.py", line 205, in generate
logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_models/llama3_1/api/model.py", line 321, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_models/llama3_1/api/model.py", line 268, in forward
out = h + self.feed_forward(self.ffn_norm(h))
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/loader.py", line 43, in swiglu_wrapper
out = ffn_swiglu(x, self.w1.weight, self.w3.weight, self.w2.weight)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 62, in ffn_swiglu
return ffn_swiglu_fp8_dynamic(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 165, in ffn_swiglu_fp8_dynamic
x1 = fc_fp8_dynamic(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/quantization/fp8_impls.py", line 146, in fc_fp8_dynamic
y = torch.ops.fbgemm.f8f8bf16_rowwise(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/_ops.py", line 1061, in __call__
return self_._op(*args, **(kwargs or {}))
RuntimeError: cutlass cannot initialize
[debug] got exception cutlass cannot initialize
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 265, in __call__
await wrap(partial(self.listen_for_disconnect, receive))
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
await func()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7ebc8462f0d0
During handling of the above exception, another exception occurred:
+ Exception Group Traceback (most recent call last):
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
| result = await app( # type: ignore[func-returns-value]
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
| return await self.app(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
| await super().__call__(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
| raise exc
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
| await self.app(scope, receive, _send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
| await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
| raise exc
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| await app(scope, receive, sender)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
| await self.middleware_stack(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
| await route.handle(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
| await self.app(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
| await wrap_app_handling_exceptions(app, request)(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
| raise exc
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
| await app(scope, receive, sender)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/routing.py", line 75, in app
| await response(scope, receive, send)
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 258, in __call__
| async with anyio.create_task_group() as task_group:
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
| raise BaseExceptionGroup(
| exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
| await func()
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response
| async for chunk in self.body_iterator:
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 84, in sse_generator
| async for event in event_gen:
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 94, in event_gen
| async for event in InferenceApiInstance.chat_completion(exec_request):
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/inference.py", line 58, in chat_completion
| for token_result in self.generator.chat_completion(
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/model_parallel.py", line 104, in chat_completion
| yield from gen
| File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 255, in run_inference
| raise obj
| RuntimeError: cutlass cannot initialize
+------------------------------------
^CW0729 16:50:42.785000 139352429928448 torch/distributed/elastic/agent/server/api.py:688] Received Signals.SIGINT death signal, shutting down workers
W0729 16:50:42.785000 139352429928448 torch/distributed/elastic/multiprocessing/api.py:734] Closing process 20066 via signal SIGINT
Exception ignored in: <function Context.__del__ at 0x7ebc85d2e950>
Traceback (most recent call last):
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 142, in __del__
self.destroy()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 324, in destroy
self.term()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/zmq/sugar/context.py", line 266, in term
super().term()
File "_zmq.py", line 545, in zmq.backend.cython._zmq.Context.term
File "_zmq.py", line 141, in zmq.backend.cython._zmq._check_rc
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 20066 got signal: 2
INFO: Shutting down
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/parallel_utils.py", line 175, in launch_dist_group
elastic_launch(launch_config, entrypoint=worker_process_entrypoint)(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 835, in _invoke_run
time.sleep(monitor_interval)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 79, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 20064 got signal: 2
INFO: Waiting for application shutdown.
shutting down
INFO: Application shutdown complete.
INFO: Finished server process [20033]
SIGINT or CTRL-C detected. Exiting gracefully (2, <frame at 0x7ebc8644bc40, file '/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/server.py', line 328, code capture_signals>)
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/trainer/.llama/venv/bin/llama", line 8, in <module>
sys.exit(main())
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/llama.py", line 54, in main
parser.run(args)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/llama.py", line 48, in run
args.func(args)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/cli/inference/start.py", line 53, in _run_inference_start_cmd
inference_server_init(
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/llama_toolchain/inference/server.py", line 115, in main
uvicorn.run(app, host=listen_host, port=port)
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/main.py", line 577, in run
server.run()
File "/home/trainer/.llama/venv/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
return asyncio.run(self.serve(sockets=sockets))
File "/usr/lib/python3.10/asyncio/runners.py", line 48, in run
loop.run_until_complete(loop.shutdown_asyncgens())
File "uvloop/loop.pyx", line 1515, in uvloop.loop.Loop.run_until_complete
RuntimeError: Event loop stopped before Future completed.
I will appreciate any help and sugggestion. Thank you in advance.