LLMUnity Crash on AMD graphics card on Windows

Describe the bug

Crash with abort when trying to use AMD graphics card in editor Model is mistral-7b-instruct-v0.2.Q4_K_M.gguf

ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.30 MiB d3d12: upload buffer was full! Waited for COPY queue for 1.118 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.902 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.897 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.896 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.901 ms. [Licensing::Client] Successfully resolved entitlement details llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4095.05 MiB llm_load_tensors: CPU buffer size = 70.31 MiB .............................................................................................. llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.24 MiB llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 [1722650470] warming up the model with an empty run ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml-cuda.cu:13061 err Asset Pipeline Refresh (id=5fe1348313ec9e4439edb8aa2e9d608c): Total: 0.010 seconds - Initiated by RefreshV2(NoUpdateAssetOptions) Asset Pipeline Refresh (id=a398558039bd1ba4a8f2fc04f6154810): Total: 0.007 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)

Steps to reproduce

No response

LLMUnity version

2.0.3

Operating System

Windows

Aug 03 '24 02:08 tempstudio

It seems to be an open llama.cpp issue (issue 1, issue 2)

Aug 13 '24 15:08 amakropoulos

@tempstudio could you check if the issue remains with the latest release (v2.2.0)?

Aug 27 '24 12:08 amakropoulos

I see the same issue with 2.2.1

` (Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137)

INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1" INFO [ init] system info | tid="27560" timestamp=1725497899 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from E:/.../Assets/StreamingAssets/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 15 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama Loaded scene 'Temp/__Backupscenes/0.backup' Deserialize: 5.726 ms Integration: 341.064 ms Integration of assets: 0.002 ms Thread Wait Time: 0.004 ms Total Operation Time: 346.796 ms llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens cache size = 3 INFO [ init] build info | tid="27560" timestamp=1725497899 build=3623 commit="436787f1" INFO [ init] system info | tid="27560" timestamp=1725497899 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " UnityEngine.StackTraceUtility:ExtractStackTrace () UnityEngine.DebugLogHandler:LogFormat (UnityEngine.LogType,UnityEngine.Object,string,object[]) UnityEngine.Logger:Log (UnityEngine.LogType,object) UnityEngine.Debug:LogWarning (object) LLMUnity.LLMUnitySetup:LogWarning (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs:143) LLMUnity.StreamWrapper:Update () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMLib.cs:66) LLMUnity.LLM:Update () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:483)

(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 143)

llm_load_vocab: token to piece cache size = 0.1637 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.07 GiB (4.83 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB d3d12: upload buffer was full! Waited for COPY queue for 1.133 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.901 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.895 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.905 ms. d3d12: upload buffer was full! Waited for COPY queue for 0.899 ms. llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4095.05 MiB llm_load_tensors: CPU buffer size = 70.31 MiB ...........[Licensing::Client] Successfully resolved entitlement details ................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.24 MiB llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369 err D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error Asset Pipeline Refresh (id=2eaefcb7421ebc541b64109c390c5c15): Total: 0.008 seconds - Initiated by RefreshV2(NoUpdateAssetOptions) `

Sep 05 '24 02:09 tempstudio

Thank you for testing! I can't implement support for this card because it is down to llama.cpp. I'll see if I can wrap around the error however so that Unity doesn't crash and you can use the GPU with Vulkan. I'll send you later a build to try 🙏

Sep 05 '24 05:09 amakropoulos

Could you try the new build by changing the LlamaLib version here from v1.1.10 to v1.1.10-dev? You will also need to delete the undreamai-v1.1.10-llamacpp folder from Assets/StreamingAssets.

With this build it should skip the HIP build and use the Vulkan instead 🤞

Sep 05 '24 14:09 amakropoulos

Apologies: I was using the wrong binaries yesterday, so even though the C# code was 2.2.1 the native code in StreamingAssets were probably still the old version. I deleted the "StreamingAssets" directory and tried it again.

It didn't crash this time, after I deleted things from StreamingAssets and reinstalled the package. but I'm pretty sure it's using the CPU, with very slow speed, high CPU usage.

Server command: -m "C:/Users/.../AppData/Roaming/LLMUnity/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" -c 4096 -b 512 --log-disable -np 1 -ngl -1 UnityEngine.StackTraceUtility:ExtractStackTrace () UnityEngine.DebugLogHandler:LogFormat (UnityEngine.LogType,UnityEngine.Object,string,object[]) UnityEngine.Logger:Log (UnityEngine.LogType,object) UnityEngine.Debug:Log (object) LLMUnity.LLMUnitySetup:Log (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs:137) LLMUnity.LLM:StartLLMServer (string) (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:373) LLMUnity.LLM/<>c__DisplayClass45_0:<Awake>b__0 () (at ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLM.cs:119) System.Threading.Tasks.Task:InnerInvoke () System.Threading.Tasks.Task:Execute () System.Threading.Tasks.Task:ExecutionContextCallback (object) System.Threading.ExecutionContext:RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) System.Threading.ExecutionContext:Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) System.Threading.Tasks.Task:ExecuteWithThreadLocal (System.Threading.Tasks.Task&) System.Threading.Tasks.Task:ExecuteEntry (bool) System.Threading.Tasks.Task:System.Threading.IThreadPoolWorkItem.ExecuteWorkItem () System.Threading.ThreadPoolWorkQueue:Dispatch () System.Threading._ThreadPoolWaitCallback:PerformWaitCallback ()

(Filename: ./Library/PackageCache/ai.undream.llm@d3d5d7fd31/Runtime/LLMUnitySetup.cs Line: 137)

warning: not compiled with GPU offload support, --gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support

...

llm_load_tensors: CPU buffer size = 4685.30 MiB ........................................................................................ llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: CPU output buffer size = 0.98 MiB llama_new_context_with_model: CPU compute buffer size = 296.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1

Giving the 1.1.0-dev a try now

Sep 06 '24 00:09 tempstudio

The behavior is the same with 1.1.0-dev.

Sep 06 '24 00:09 tempstudio

You are using num GPU layers -1 which will not use the GPU. Could you try e.g. with 10? There should be debug messages that start with "Tried architecture", can you post those as well?

Sep 06 '24 04:09 amakropoulos

I thought -1 would mean all / max? With 9999 GPU Layers it crashed with the same error even on 1.1.10-dev :/ I think it's been the same issue.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4403.50 MiB llm_load_tensors: CPU buffer size = 281.81 MiB ........................................Asset Pipeline Refresh (id=2020b226d14d319468ddb810101aa4ca): Total: 0.008 seconds - Initiated by RefreshV2(NoUpdateAssetOptions) ............................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB llama_new_context_with_model: ROCm0 compute buffer size = 258.50 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369 err D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error Asset Pipeline Refresh (id=7f5d46cd6ec704f4ba373546e19f8732): Total: 0.006 seconds - Initiated by RefreshV2(NoUpdateAssetOptions)

Sep 06 '24 23:09 tempstudio

Tried it with flash attention OFF and it's the same: ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4403.50 MiB llm_load_tensors: CPU buffer size = 281.81 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 512.00 MiB llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.98 MiB llama_new_context_with_model: ROCm0 compute buffer size = 296.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 16.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 ggml_cuda_compute_forward: RMS_NORM failed CUDA error: invalid device function current device: 0, in function ggml_cuda_compute_forward at D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:16369 err D:/a/LlamaLib/LlamaLib/llama.cpp/ggml/src/ggml-cuda.cu:14155: CUDA error Unable to find style 'TemplatesPromo' in skin 'DarkSkin' Layout

Sep 06 '24 23:09 tempstudio

Thanks a lot! Could you do one more test with v1.1.10-dev2?

Sep 08 '24 06:09 amakropoulos

Couple problems I encounted with 1.1.0-dev2: First, the install didn't work, it just installed an empty folder. I manually downloaded the entire zip and unzipped into the streaming assets folder. After that, the same error happens. Third, I deleted the two "windows-cuda" folders from the directory. It crashed again. Finally, I deleted the "windows-hip" folder from the directory, it doesn't crash anymore, but it doesn't use the GPU. It seems it's not even going to try Vulkan.

Sep 08 '24 17:09 tempstudio

Thanks a lot. I have fixed the issue with the empty folder in v2.2.2. It seems I can't do much at the moment for the specific GPU unfortunately. I'll keep an eye on the llama.cpp updates and let you know once I find a solution.

Sep 08 '24 19:09 amakropoulos

I'm going through some issues and I have an idea. I may have to specify your GPU architecture in the HIP build.

Sep 08 '24 19:09 amakropoulos

Could you try the v1.1.11 build? I have specifically set AMD architectures included the one of your GPU (gfx1030).

Sep 09 '24 21:09 amakropoulos

The good news is that it doesn't crash anymore. The bad news is that the performance is much worse than CPU only. running the chat pegs GPU usage to 100% and it stutters. It also took extremely long to generate anything. I recall having with llamafile and it was running at least 20x faster than this (this is with only 1 layer on the GPU; using all layers makes the OS unresponsive):

INFO [ print_timings] prompt eval time = 192189.92 ms / 399 tokens ( 481.68 ms per token, 2.08 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 n_prompt_tokens_processed=399 t_token=481.67899749373436 n_tokens_second=2.0760714193543555 INFO [ print_timings] generation eval time = 24258.31 ms / 41 runs ( 591.67 ms per token, 1.69 tokens per second) | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_token_generation=24258.305 n_decoded=41 t_token=591.6659756097561 n_tokens_second=1.6901428191293661 INFO [ print_timings] total time = 216448.23 ms | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 t_prompt_processing=192189.92 t_token_generation=24258.305 t_total=216448.225 INFO [ update_slots] slot released | tid="5292" timestamp=1725926634 id_slot=0 id_task=1 n_ctx=2048 n_past=439 n_system_tokens=0 n_cache_tokens=439 truncated=false INFO [ update_slots] all slots are idle | tid="5292" timestamp=1725926634 INFO [ update_slots] all slots are idle | tid="5292" timestamp=1725926634

I have updated to the latest drivers and also just restarted my system.

Sep 10 '24 00:09 tempstudio

Yes! That works! What happens if you use more layers but not extreme ones e.g. 10, 25, 50?

Sep 10 '24 04:09 amakropoulos

Performance is equally bad with 10/30 layers.

10 layers: prompt processing 2tk/s generation 1tk/s 30 layers: prompt processing 2tk/s generation 0.5tk/s

Sep 11 '24 01:09 tempstudio

Is there any possibility of the performance issue being fixed in llamalib? If not, is it possible to provide a 2.x build that uses llamafile as a backend?

Sep 13 '24 00:09 tempstudio

I really doubt it is a problem of LlamaLib because I use and extend code directly from llama.cpp and llamafile.

This is an overview of the different libraries:

llama.cpp it is the main implementation that all libraries use. Specifically for GPU it uses CUDA (Nvidia) and CUDA+HIP (AMD). This is the fastest but including CUDA in the builds increases the build size to 1GB / build. To support most Nvidia GPUs I include both CUDA 11 and 12 builds that would mean 2 GBs.
llamafile It packages and serves llama.cpp in just a single file for all OSes. Specifically for GPUs, it uses CUDA (Nvidia) and CUDA+HIP (AMD) if the system has CUDA already installed (rare, unless you are into AI). Otherwise it uses its own tinyBLAS implementation which has speed lower or equal to CUDA (from version 0.7 onwards). The benefit is that it needs less than 100MB to include in the build.
LlamaLib It extends llama.cpp with functionality needed to use as a Unity / C# library and builds binaries for the different architectures. I use the llama.cpp implementation but specifically for GPUs I hack it and use tinyBLAS to keep the build size small.

The source of the speed issue is most probably on the tinyBLAS implementation of llamafile. If you have CUDA installed or use llamafile with a version earlier than 0.7, llamafile will still use CUDA which will give you the speed boost.

Sep 13 '24 05:09 amakropoulos

There are reasons why I don't use llamafile anymore, although I love the project:

it has antivirus issues (false positives). This is because it builds llama.cpp on the fly directly on the system that uses it. I actually whitelisted it myself for McAfee antivirus.
it can only be included as a server, not as DLL. This can only be used in IL2CPP builds. Also someone could create a similar server locally and take over your game. I have spent a lot of time for workarounds to try and prevent that.
It can't be used for mobile deployment (Android / iOS).

For these reasons I can't bring it back to the project. I'd prefer to find where the source of the problem is and solve it there. It is tricky for me to work with AMD because I don't have one and there is none available on the cloud that is supported.

You could try the following to understand more about the issue using the latest llamafile.

Check the timings for both cases: llamafile without CUDA

Uninstall CUDA
Delete the .llamafile folder from your system. It will be on your user directory (C:/Users/<USER>)
From cmd:
- cd inside the directory that contains llamafile
- run llamafile-0.8.13.exe -m <path_to_model> -ngl 10 -p "to be or" --nocompile --tinyblas

llamafile with CUDA

Install CUDA
Delete the .llamafile folder from your system. It will be on your user directory (C:/Users/<USER>)
From cmd:
- cd inside the directory that contains llamafile
- run llamafile-0.8.13.exe -m <path_to_model> -ngl 10 -p "to be or"

Then we could find out which implementation is the culprit.

Sep 13 '24 05:09 amakropoulos

I will give those a try. Can you build llamalib into a command line standalone so that I can test that too, just in case there's something wonky going on with gpu resource sharing between the ai and unity?

Sep 13 '24 14:09 tempstudio

Here is the performance with tinyBLAS. I don't believe the CUDA run is needed as I'm using an AMD system and it doesn't support CUDA. I will be very happy if I can get this type of performance inside Unity.

.\llamafile-0.8.13.exe -m .\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 99 -p "to be or" --nocompile --tinyblas -c 2048

llama_print_timings: load time = 2364.86 ms llama_print_timings: sample time = 55.42 ms / 773 runs ( 0.07 ms per token, 13948.79 tokens per second) llama_print_timings: prompt eval time = 36.01 ms / 4 tokens ( 9.00 ms per token, 111.08 tokens per second) llama_print_timings: eval time = 22152.88 ms / 772 runs ( 28.70 ms per token, 34.85 tokens per second) llama_print_timings: total time = 22420.02 ms / 776 tokens Log end

More logs that might be helpful:

import_cuda_impl: initializing gpu module... get_rocm_bin_path: note: amdclang++.exe not found on $PATH get_rocm_bin_path: note: /D/Drivers/ROCM/5.7//bin/amdclang++.exe does not exist get_rocm_bin_path: note: clang++.exe not found on $PATH link_cuda_dso: note: dynamically linking /C/Users/Tony/.llamafile/v/0.8.13/ggml-rocm.dll ggml_cuda_link: welcome to ROCm SDK with tinyBLAS link_cuda_dso: GPU support loaded llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloaded 32/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 3992.51 MiB llm_load_tensors: CPU buffer size = 4685.30 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB llama_new_context_with_model: ROCm0 compute buffer size = 669.48 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 4

Another piece of info: during the execution I see that the GPU usage is at 1% instead of 99% that I see when using llamalib in task manager. This might be inaccurate.

Sep 13 '24 23:09 tempstudio

FYI I got llama.cpp's vulkan build to work (need to set GGML_VK_VISIBLE_DEVICES=0) and timing is like this:

llama_perf_sampler_print: sampling time = 63.40 ms / 780 runs ( 0.08 ms per token, 12303.03 tokens per second) llama_perf_context_print: load time = 2719.64 ms llama_perf_context_print: prompt eval time = 184.44 ms / 4 tokens ( 46.11 ms per token, 21.69 tokens per second) llama_perf_context_print: eval time = 12409.56 ms / 775 runs ( 16.01 ms per token, 62.45 tokens per second) llama_perf_context_print: total time = 12738.39 ms / 779 tokens

So it's (potentially) faster to run vulkan than HIP w./ tinyBLAS. Maybe that's an easier thing to get working than HIP?

Sep 17 '24 01:09 tempstudio

Thanks for all the testing! I have already included Vulkan as a fallback but is called if HIP doesn't work. You can switch to that if you disable these 2 lines: https://github.com/undreamai/LLMUnity/blob/main/Runtime/LLMLib.cs#L368 https://github.com/undreamai/LLMUnity/blob/main/Runtime/LLMLib.cs#L374 Could you try if that works better?

Sep 17 '24 05:09 amakropoulos

Could you also try the following to see if the build works at the same speed as tinyBLAS?

Setup
- download the LLM DLLs and server binaries using the DLLs
- unzip them in the same directory
from command line:
- cd inside the directory, and inside the windows-hip directory
- run undreamai_server.exe -m <path_to_Llama-3.1> -ngl 99 -c 2048 --port 13333 --template "llama3 chat"
You can then use it from Unity as a remote server
- Open the SimpleInteraction sample
- Delete the LLM GameObject
- Enable the LLMCharacter GameObject Remote flag
- Run the scene and start a chat
Check from command line the timings

Sep 17 '24 05:09 amakropoulos

Could we maybe have a call to resolve this? It would be really helpful! You can find me at the Discord server.

Sep 17 '24 05:09 amakropoulos

(1) Vulkan doesn't work because of this problem, it detects the same graphics card twice and then fails to load: https://github.com/ggerganov/llama.cpp/issues/9516 I tried to use C# API to set ENV variables but that seems to behave very strangely. It seems to take effect after a full restart of the editor. So it doesn't work and will refuse to work despite

UnityEngine.Debug.Log(Environment.GetEnvironmentVariable("GGML_VK_VISIBLE_DEVICES"));

prints 0 - until the editor and the unity hub is restarted. This isn't going to fly for a production build.

(2) The performance for hip server is as bad as it is in editor:

INFO [           print_timings] prompt eval time     =   90492.37 ms /   195 tokens (  464.06 ms per token,     2.15 tokens per second) | tid="2696" timestamp=1726616971 id_slot=0 id_task=0 t_prompt_processing=90492.37 n_prompt_tokens_processed=195 t_token=464.06343589743585 n_tokens_second=2.154877809035171
INFO [           print_timings] generation eval time =   72538.86 ms /    45 runs   ( 1611.97 ms per token,     0.62 tokens per second) | tid="2696" timestamp=1726616971 id_slot=0 id_task=0 t_token_generation=72538.859 n_decoded=45 t_token=1611.9746444444443 n_tokens_second=0.6203571522954339

(3) The vulkan server works with the right env variable. The performance of vulkan server matches llama.cpp

Sep 18 '24 00:09 tempstudio

@tempstudio I visited the issue again. I checked the open llamafile issues and found an issue about using HIP > 5.5. I recompiled the llama.cpp library with HIP 5.5.

Could you try the latest LLMUnity version and change this line: https://github.com/undreamai/LLMUnity/blob/main/Runtime/LLMUnitySetup.cs#L105 to public static string LlamaLibVersion = "v1.1.12-dev"; to see if it works with your AMD card?

Nov 11 '24 12:11 amakropoulos