tabby Can't seem to get Tabby to run on GPU

Describe the bug

(This might be me doing something wrong, and not a bug!)

I can't seem to get Tabby to run using my GPU (Radeon 6600 XT). Not with ROC, (which I believe is unsupported for my device) nor with Vulkan — which I believe should be supported?

Whenever I run tabby (serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device vulkan) and go into the Web UI to ask something I check my CPU and GPU usage (using btop and amdgpu_top respectively) I see my CPU usage spiking and almost no effect on the GPU. (This is when running v0.15.0-r2 or compiling v0.15.0-r3 myself).

If I try to use v0.14.0 (same options) I instead get this:

2024-08-08T11:19:34.328360Z  WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: warning: see main README.md for information on enabling GPU BLAS support

Information about your version

See above.

Information about your GPU

From vulkaninfo:

Devices:
========
GPU0:
	apiVersion         = 1.3.270
	driverVersion      = 2.0.294
	vendorID           = 0x1002
	deviceID           = 0x73ff
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon RX 6600 XT
	driverID           = DRIVER_ID_AMD_PROPRIETARY
	driverName         = AMD proprietary driver
	driverInfo         = (AMD proprietary shader compiler)
	conformanceVersion = 1.3.3.1
	deviceUUID         = 00000000-2800-0000-0000-000000000000
	driverUUID         = 414d442d-4c49-4e55-582d-445256000000
GPU1:
	apiVersion         = 1.3.278
	driverVersion      = 24.1.5
	vendorID           = 0x1002
	deviceID           = 0x73ff
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon RX 6600 XT (RADV NAVI23)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 24.1.5
	conformanceVersion = 1.3.0.0
	deviceUUID         = 00000000-2800-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000

From rocminfo (if it helps):

=====================
HSA System Attributes
=====================
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 3600 6-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 3600 6-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3600
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32763224(0x1f3ed58) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32763224(0x1f3ed58) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32763224(0x1f3ed58) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1032
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 6600 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      2048(0x800) KB
    L3:                      32768(0x8000) KB
  Chip ID:                 29695(0x73ff)
  ASIC Revision:           0(0x0)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   2900
  BDFID:                   10240
  Internal Node ID:        1
  Compute Unit:            32
  SIMDs per CU:            2
  Shader Engines:          2
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 118
  SDMA engine uCode::      76
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1032
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional context

Full terminal output when starting Tabby (v0.15.0-r3, compiled myself):

✓ $ cargo run --features vulkan --release serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device vulkan --parallelism 2
warning: function `tracing_context` is never used
  --> ee/tabby-webserver/src/hub.rs:15:4
   |
15 | fn tracing_context() -> tarpc::context::Context {
   |    ^^^^^^^^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: `tabby-webserver` (lib) generated 1 warning
warning: function `chat_completions_utoipa` is never used
  --> crates/tabby/src/routes/chat.rs:29:14
   |
29 | pub async fn chat_completions_utoipa(_request: Json<serde_json::Value>) -> Statu...
   |              ^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(dead_code)]` on by default

warning: `tabby` (bin "tabby") generated 1 warning
    Finished release [optimized] target(s) in 0.31s
     Running `target/release/tabby serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device vulkan --parallelism 2`
2024-08-08T11:27:47.846355Z DEBUG tabby_common::config: crates/tabby-common/src/config.rs:35: Config file /home/USER/.tabby/config.toml not found, apply default configuration
2024-08-08T11:27:48.439867Z DEBUG tabby::serve: crates/tabby/src/serve.rs:411: Starting server, this might take a few minutes...
2024-08-08T11:27:48.494017Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:124: Waiting for llama-server <embedding> to start...
2024-08-08T11:27:48.634321Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:132: llama-server <embedding> started successfully
2024-08-08T11:27:48.649300Z DEBUG tabby_common::config: crates/tabby-common/src/config.rs:35: Config file /home/USER/.tabby/config.toml not found, apply default configuration
2024-08-08T11:27:48.649713Z DEBUG tabby::services::tantivy: crates/tabby/src/services/tantivy.rs:33: Index is ready, enabling search...
2024-08-08T11:27:48.948439Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:124: Waiting for llama-server <chat> to start...
2024-08-08T11:27:49.463618Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:132: llama-server <chat> started successfully
2024-08-08T11:27:49.659853Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:124: Waiting for llama-server <completion> to start...
2024-08-08T11:27:50.422822Z DEBUG llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:132: llama-server <completion> started successfully

████████╗ █████╗ ██████╗ ██████╗ ██╗   ██╗
╚══██╔══╝██╔══██╗██╔══██╗██╔══██╗╚██╗ ██╔╝
   ██║   ███████║██████╔╝██████╔╝ ╚████╔╝
   ██║   ██╔══██║██╔══██╗██╔══██╗  ╚██╔╝
   ██║   ██║  ██║██████╔╝██████╔╝   ██║
   ╚═╝   ╚═╝  ╚═╝╚═════╝ ╚═════╝    ╚═╝

📄 Version 0.15.0-rc.3
🚀 Listening at 0.0.0.0:8080

Aug 08 '24 11:08 BurnyLlama

https://github.com/TabbyML/tabby/issues/2810

Aug 08 '24 11:08 VladislavNekto

If I try to use v0.14.0 (same options) I instead get this:

That i was get too, but only at builded tabby, when i am compiling 0.14.0 by my self it's works

By the way, ROCm will work at RX6600.

Tabby only supports the use of a single GPU. To utilize multiple GPUs, you can initiate multiple Tabby instances and set CUDA_VISIBLE_DEVICES (for cuda) or HIP_VISIBLE_DEVICES (for rocm) accordingly. You can use the HSA_OVERRIDE_GFX_VERSION variable if there is a similar GPU that is supported by ROCm you can set it to that. For example for RDNA2 you can set it to 10.3.0 and to 11.0.0 for RDNA3.

Aug 08 '24 11:08 VladislavNekto

Right! I never tried to build v0.14.0 myself. I will see if I have time to try that later. If so, I will come back with the results.

Aug 08 '24 15:08 BurnyLlama

I realized I didn't compile for ROCm before, only for Vulkan. When I try to compile with --features=rocm:

error: failed to run custom build command for `llama-cpp-server v0.14.0 (/tmp/tabby-src/crates/llama-cpp-server)`

Caused by:
  process didn't exit successfully: `/tmp/tabby-src/target/release/build/llama-cpp-server-1b3738a0281592c2/build-script-build` (exit status: 101)
  --- stdout
  CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
  CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_TOOLCHAIN_FILE = None
  CMAKE_TOOLCHAIN_FILE = None
  CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
  CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_GENERATOR = None
  CMAKE_GENERATOR = None
  CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
  CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
  HOST_CMAKE_PREFIX_PATH = None
  CMAKE_PREFIX_PATH = None
  CMAKE_x86_64-unknown-linux-gnu = None
  CMAKE_x86_64_unknown_linux_gnu = None
  HOST_CMAKE = None
  CMAKE = None
  running: cd "/tmp/tabby-src/target/release/build/llama-cpp-server-40e8fadd415341a4/out/build" && CMAKE_PREFIX_PATH="" "cmake" "/tmp/tabby-src/crates/llama-cpp-server/./llama.cpp" "-DLLAMA_NATIVE=OFF" "-DBUILD_SHARED_LIBS=OFF" "-DINS_ENB=ON" "-DLLAMA_HIPBLAS=ON" "-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang" "-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++" "-DAMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1031;gfx1100;gfx1101;gfx1102;gfx1103" "-DCMAKE_INSTALL_PREFIX=/tmp/tabby-src/target/release/build/llama-cpp-server-40e8fadd415341a4/out" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/usr/bin/cc" "-DCMAKE_BUILD_TYPE=Release"
  -- The C compiler identification is unknown
  -- The CXX compiler identification is unknown
  -- Configuring incomplete, errors occurred!

  --- stderr
  CMake Error at CMakeLists.txt:2 (project):
    The CMAKE_C_COMPILER:

      /opt/rocm/llvm/bin/clang

    is not a full path to an existing compiler tool.

    Tell CMake where to find the compiler by setting either the environment
    variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
    the compiler, or to the compiler name if it is in the PATH.


  CMake Error at CMakeLists.txt:2 (project):
    The CMAKE_CXX_COMPILER:

      /opt/rocm/llvm/bin/clang++

    is not a full path to an existing compiler tool.

    Tell CMake where to find the compiler by setting either the environment
    variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
    to the compiler, or to the compiler name if it is in the PATH.


  thread 'main' panicked at /home/USER/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cmake-0.1.50/src/lib.rs:1098:5:

  command did not execute successfully, got: exit status: 1

  build script failed, must exit now
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I use Gentoo and I should (as far as I know) have ROCm installed, but maybe I am missing something... I'll have to look into this more.

(This happens both on v0.14.0 and v0.15.0-r3...)

Aug 08 '24 15:08 BurnyLlama

A bit of a hacky solution that at least gets Tabby to compile:

sudo mkdir /opt/rocm
sudo ln -sv /usr/lib/llvm/18 /opt/rocm/llvm

However, besides seeming slower to generate responses in the chat, there's no difference. It still seems to run on the CPU when using:

HSA_OVERRIDE_GFX_VERSION=10.3.0 cargo run --features rocm --release serve --model DeepseekCoder-1.3B --chat-model Qwen2-1.5B-Instruct --device rocm

I also tried adding HCC_AMDGPU_TARGET=gfx1032 to specify using my GPU and not my CPU. I don't know if that is how I am supposed to use that environment variable, but it did not work.

Aug 08 '24 18:08 BurnyLlama

The fix for rocm is merged, the vulkan fix is probably similar (I haven't tested it): https://github.com/TabbyML/tabby/issues/2810#issuecomment-2283356626

Aug 12 '24 08:08 richard-jfc

The fix for rocm is merged, the vulkan fix is probably similar (I haven't tested it): #2810 (comment)

Does this fix also work on Windows? I'm running Tabby 0.18.0 and using --device rocm still ends up running the models on my CPU, not the GPU.

Oct 24 '24 14:10 ptsouchlos

The fix for rocm is merged, the vulkan fix is probably similar (I haven't tested it): #2810 (comment)

Does this fix also work on Windows? I'm running Tabby 0.18.0 and using --device rocm still ends up running the models on my CPU, not the GPU.

We are not distributing Windows binaries for ROCm at the moment, so it won't work.

I recommend using the Vulkan backend for Windows if you have a non-Nvidia GPU card.

Oct 24 '24 14:10 wsxiaoys