nim-anywhere icon indicating copy to clipboard operation
nim-anywhere copied to clipboard

Does not run on Titan RTX - demands Bfloat16

Open freemansoft opened this issue 1 year ago • 2 comments

Received this message starting the LLM.

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA TITAN RTX GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.

Is there a way to

  1. Set the desired precision
  2. Select a different model that works for my RTX Titan? What models could be swapped in for LLM_NIM_0_MODEL=meta/llama3-8b-instruct

The model is run with

INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: tp: 1
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: feat_lora: false
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: precision: fp16
INFO 07-20 23:02:05.815 ngc_injector.py:146] Profile metadata: llm_engine: vllm
INFO 07-20 23:02:05.815 ngc_injector.py:166] Preparing model workspace. This step might download additional files to run the model.
INFO 07-20 23:02:08.174 ngc_injector.py:172] Model workspace is now ready. It took 2.359 seconds
INFO 07-20 23:02:08.180 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-0_0f1rb6', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-0_0f1rb6', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)

Is there any way to change the type? I tried running this with other models that were described as able to run with float16 but the startup seems to always chooses bfloat 16. Ref: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md

freemansoft avatar Jul 20 '24 22:07 freemansoft

More logs


INFO 07-23 03:34:51 selector.py:65] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 07-23 03:34:51 selector.py:33] Using XFormers backend.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 498, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 412, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 365, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 323, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 148, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 22, in _init_executor
    self._init_non_spec_worker()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 50, in _init_non_spec_worker
    self.driver_worker.init_device()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 100, in init_device
    _check_if_gpu_supports_dtype(self.model_config.dtype)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 321, in _check_if_gpu_supports_dtype
    raise ValueError(
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA TITAN RTX GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.
[FAILED]
Starting the container

freemansoft avatar Jul 23 '24 03:07 freemansoft

Thanks for reporting!

This seems to be a bug either with the NIM or the documentation. I'll raise this issue internally and see if support for chips older than Ampere is intended.

rmkraus avatar Jul 29 '24 23:07 rmkraus

A bug has been filed with the NIM team to make a decision on whether this should be a supported path.

rmkraus avatar Jul 30 '24 17:07 rmkraus

@freemansoft I used a V100 GPU and encountered the same error as you. What you can do is go to your $LOCAL_NIM_CACHE folder and modify the torch_dtype to float16 instead of bfloat16 in the config.json file.

For example, the config.json file is located in my $LOCAL_NIM_CACHE/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf directory.

JIA-HONG-CHU avatar Aug 02 '24 07:08 JIA-HONG-CHU

That worked for loading the model👍 Have not tested it yet. Don't know if there will be any behavior issues with the reduce precision of float16 vs bfloat16. Model definition is here https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

On my remote Ubuntu system....

  1. The actual variable name is NGC_HOME as defined in variables.env
  2. $NGC_HOME was ~/.cache/nvidia/nvidia-nims
  3. The model config.json was in $NGC_HOME/ngc/hub/hub/models--nim-meta--llama3-8b-instruct/snapshots/hf/config.json
  4. This expanded to ~/.cache/nvidia/nvidia-nims/ngc/hub/models--nim--meta--llama3-8b-instruct/snapshots/hf/config.json

After loading the model

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA TITAN RTX               Off |   00000000:05:00.0 Off |                  N/A |
| 41%   31C    P8              5W /  280W |   20214MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1491      G   /usr/lib/xorg/Xorg                             69MiB |
|    0   N/A  N/A      1904      G   /usr/bin/gnome-shell                           59MiB |
|    0   N/A  N/A     71037      C   python3                                     20072MiB |
+-----------------------------------------------------------------------------------------+

freemansoft avatar Aug 02 '24 13:08 freemansoft

The latest version of Nim Anywhere - Aug 6 or Aug 7 - updated some packages so that this now works on my Ubuntu machine with type coercion from bfloat16 to float16 per the OPs technique. My simple queries ran fine even though type coercion is generally a bad idea because of precision differences

freemansoft avatar Aug 08 '24 02:08 freemansoft