nim-anywhere
nim-anywhere copied to clipboard
Does not run on Titan RTX - demands Bfloat16
Received this message starting the LLM.
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA TITAN RTX GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
Is there a way to
- Set the desired precision
- Select a different model that works for my RTX Titan? What models could be swapped in for
LLM_NIM_0_MODEL=meta/llama3-8b-instruct
The model is run with
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: tp: 1
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 07-20 23:02:05.815 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 07-20 23:02:08.174 ngc_injector.py:172] Model workspace is now ready. It took 2.359 seconds
INFO 07-20 23:02:08.180 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-0_0f1rb6', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-0_0f1rb6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Is there any way to change the type? I tried running this with other models that were described as able to run with float16 but the startup seems to always chooses bfloat 16. Ref: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md
More logs
INFO 07-23 03:34:51 selector.py:65] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 07-23 03:34:51 selector.py:33] Using XFormers backend.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 22, in _init_executor
self._init_non_spec_worker()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 50, in _init_non_spec_worker
self.driver_worker.init_device()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 100, in init_device
_check_if_gpu_supports_dtype(self.model_config.dtype)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 321, in _check_if_gpu_supports_dtype
raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA TITAN RTX GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
[FAILED]
Starting the container
Thanks for reporting!
This seems to be a bug either with the NIM or the documentation. I'll raise this issue internally and see if support for chips older than Ampere is intended.
A bug has been filed with the NIM team to make a decision on whether this should be a supported path.
@freemansoft I used a V100 GPU and encountered the same error as you. What you can do is go to your $LOCAL_NIM_CACHE folder and modify the torch_dtype to float16 instead of bfloat16 in the config.json file.
For example, the config.json file is located in my $LOCAL_NIM_CACHE/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf directory.
That worked for loading the model👍 Have not tested it yet. Don't know if there will be any behavior issues with the reduce precision of float16 vs bfloat16. Model definition is here https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
On my remote Ubuntu system....
- The actual variable name is
NGC_HOMEas defined invariables.env $NGC_HOMEwas~/.cache/nvidia/nvidia-nims- The model
config.jsonwas in$NGC_HOME/ngc/hub/hub/models--nim-meta--llama3-8b-instruct/snapshots/hf/config.json - This expanded to
~/.cache/nvidia/nvidia-nims/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf/config.json
After loading the model
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA TITAN RTX Off | 00000000:05:00.0 Off | N/A |
| 41% 31C P8 5W / 280W | 20214MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 69MiB |
| 0 N/A N/A 1904 G /usr/bin/gnome-shell 59MiB |
| 0 N/A N/A 71037 C python3 20072MiB |
+-----------------------------------------------------------------------------------------+
The latest version of Nim Anywhere - Aug 6 or Aug 7 - updated some packages so that this now works on my Ubuntu machine with type coercion from bfloat16 to float16 per the OPs technique. My simple queries ran fine even though type coercion is generally a bad idea because of precision differences