jetson-containers
jetson-containers copied to clipboard
agx xavier dustynv/local_llm:r35.3.1 error
hardware information: agx xavier, L4T:35.4.0, Jetpack:5.1.2. and cloned the branch R35.4.1.
I tried below commands in container dustynv/local_llm:r35.3.1 and had error:
root@agx-xavier:/data/models/mlc/dist/models# python3 -m local_llm --api=mlc --model=Llama-2-7b-chat-hf
/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py:123: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
12:35:57 | INFO | loading Llama-2-7b-chat-hf with MLC
12:35:57 | INFO | running MLC quantization:
python3 -m mlc_llm.build --model /data/models/mlc/dist/models/Llama-2-7b-chat-hf --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist
Using path "/data/models/mlc/dist/models/Llama-2-7b-chat-hf" for model "Llama-2-7b-chat-hf"
Target configured: cuda -keys=cuda,gpu -arch=sm_72 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_72 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%| | 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 194/197 [01:38<00:00, 4.06tensors/sFinish computing and quantizing weights.████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍| 326/327 [01:38<00:00, 10.18tensors/s]
Total param size: 3.1569595336914062 GB
Start storing to cache /data/models/mlc/dist/Llama-2-7b-chat-hf-q4f16_ft/params
[0327/0327] saving param_326
All finished, 99 total shards committed, record saved to /data/models/mlc/dist/Llama-2-7b-chat-hf-q4f16_ft/params/ndarray-cache.json████████████████████████████████████████████████████████████████████████████████████████| 327/327 [01:50<00:00, 10.18tensors/s]
Finish exporting chat config to /data/models/mlc/dist/Llama-2-7b-chat-hf-q4f16_ft/params/mlc-chat-config.json
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/mlc_llm/build.py", line 47, in
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/Llama-2-7b-chat-hf --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 4096 --artifact-path /data/models/mlc/dist' returned non-zero exit status 1.
root@agx-xavier:/data/models/mlc/dist/models# nvidia-smi
bash: nvidia-smi: command not found
root@agx-xavier:/data/models/mlc/dist/models#
AssertionError: sm72 not supported yet.
@UserName-wang I believe MLC only supports SM80 and Orin due to the kernel optimizations used
AssertionError: sm72 not supported yet.
@UserName-wang I believe MLC only supports SM80 and Orin due to the kernel optimizations used
@dusty-nv , thank you for your reply, do you have any suggestion for the user who have to use agx xavier for LLM application?
@UserName-wang on Xavier I would use llama.cpp container instead, it gets the 2nd-best performance and supports quantization