llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

HIP SDK with AMD iGPU rocBLAS error

Open atsetogl opened this issue 1 year ago • 7 comments

I am using a gfx1103 and try to run llama.cpp on Windows.

Steps done:

  • installed HIP SDK
  • installed perl & ninja
  • created environment variable set HSA_OVERRIDE_GFX_VERSION=11.0.0 (since gfx1103 is normally not supported by HIP SDK)
  • built llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
set PATH=%HIP_PATH%\bin;%PATH%
mkdir build
cd build
cmake -G Ninja -DAMDGPU_TARGETS=gfx1100 -DLLAMA_HIPBLAS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
cmake --build .

Note: I used -DAMDGPU_TARGETS=gfx1100 as HSA_OVERRIDE_GFX_VERSION=11.0.0 since gfx1103 is not supported by HIP SDK.

  • tried running main with this command: main.exe -m <model>.gguf -p <prompt> -n 400 -ngl 99 -e. However I get the following error:
Log start
main: build = 2612 (1b496a74)
main: built with  for x86_64-pc-windows-msvc
main: seed  = 1712338084
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../../mymodels/llama-13b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = meta-llama-13b
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name     = meta-llama-13b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'

rocBLAS error: Cannot read C:\Program Files\AMD\ROCm\5.7\bin\/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1103

rocBLAS error: Could not initialize Tensile host:
regex_error(error_backref): The expression contained an invalid back reference.

In other words, even though I specify -DAMDGPU_TARGETS=gfx1100 and HSA_OVERRIDE_GFX_VERSION=11.0.0 the program will still try to use gfx1103 which of course is not possible.

Could you please help me with that?

Thanks in advance!

atsetogl avatar Apr 05 '24 17:04 atsetogl

Windows doesn't support HSA_OVERRIDE_GFX_VERSION and probably doesn't have its own equivalent. You would need to compile a Tensile library for gfx1103 for rocBLAS 5.7, or use Linux.

Engininja2 avatar Apr 05 '24 18:04 Engininja2

Windows doesn't support HSA_OVERRIDE_GFX_VERSION and probably doesn't have its own equivalent. You would need to compile a Tensile library for gfx1103 for rocBLAS 5.7, or use Linux.

https://github.com/ROCm/ROCm/discussions/2631#discussioncomment-7745585 mentioned in linux it can work by just copy TensileLibrary_lazy_gfx1102.dat to TensileLibrary_lazy_gfx1103.dat sudo cp /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1103.dat That's probably not the optimal solution, but it works flawlessly, so far.

But seems like it doesn't work in windows?

lihaofd avatar Apr 10 '24 11:04 lihaofd

Thank you very much for your reply! I managed to get it running however the output shows that something goes wrong :/

Here is the command I executed:

main.exe -m ../../../mymodels/llama-13b.Q8_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -ngl 99 -e

and this is the output:

Log start
main: build = 2637 (400d5d72)
main: built with  for x86_64-pc-windows-msvc
main: seed  = 1712761673
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../../mymodels/llama-13b.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = meta-llama-13b
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:  282 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 12.88 GiB (8.50 BPW)
llm_load_print_meta: general.name     = meta-llama-13b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon(TM) Graphics, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      ROCm0 buffer size = 13023.85 MiB
llm_load_tensors:        CPU buffer size =   166.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =    85.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1:BU#############################################################################################################ّ...ben”maleseg fuelccsegselitiatippRSKarjervcpet于 IBRSlet serial:#ней seg sa controlledetersourgTo1grbarambambippgelebarbe Mannbaratbugcl Hunter Mannatcivudaviaaviaegujenjencleps Kontletterhaltenkaautéamb graywOperationUPDATE Harr SabplateCPambnylev generalwasplatewarnholdignplate�expertAlertplategeneratedspectgenerated4ex Axagrpert' AlertpertpertpertgenerateMAoviїwasapatangeswasuouthbenpleazugewas...miautéahn Diamiimiwasedenvgeneratedracypecgenerated Ama Hernheswasgeneratedgeneratedbólangevron3ellemaühisiouthisebeni1… Hern4rubenHERHERalus1...g......7...)7VAR Hern Herncuruntangelёpsanzpostvaridl...) AfrbarrsPod blackariegotbarplateillonitzgategeneratedgeneratedgeneratedpagespodades Success MovресavidissesPDkesenc-Sivpecpec bars hochnad deltatsis tagsitesudдьSaeraskifanfanes Urseskena|nach enforesherr|bersever (' scalkk Lubnak Gor3 embarkolPD >>cal migr scal| scalplateGeneral lad|kalkolcur0 lac lac separatelypecDAT lac lac lacCldledenklpecaadiuntkalpeнейeszPD amazonPD Petythjar Ceteskstan
llama_print_timings:        load time =   13189.40 ms
llama_print_timings:      sample time =      31.87 ms /   400 runs   (    0.08 ms per token, 12551.78 tokens per second)
llama_print_timings: prompt eval time =    1029.69 ms /    19 tokens (   54.19 ms per token,    18.45 tokens per second)
llama_print_timings:        eval time =   86713.29 ms /   399 runs   (  217.33 ms per token,     4.60 tokens per second)
llama_print_timings:       total time =   88241.82 ms /   418 tokens
Log end

Have you ever seen that before?

Thank you in advance!

atsetogl avatar Apr 10 '24 15:04 atsetogl

Unlike RDNA2 where everything is more or less gfx1030 RDNA3 ISAs have significant differences. In the linked comment '(more than "-ngl 32" resulted in gibberish)'. You could try offloading 1 less layer than the max and setting --no-kv-offload, or try a 7B llama model and the same. One possibility is the kernels for whatever matrix multiplication shapes are needed for mistral/llama7B will work on gfx1103 despite being compiled for gfx1102, but other kernels rocblas uses for e.g. the kv cache won't because of an architectural difference. Or rather than using Tensile kernels for some shapes rocblas might use the compiled-in ones which won't be there for gfx1103 unless you recompile rocblas.

Engininja2 avatar Apr 10 '24 18:04 Engininja2

Thank you very much for this detailed explanation! Indeed after some testing I have to set the --no-kv-offload and get the -ngl down. In fact I get correct results with the -ngl up to 24, by the moment I set it to 25 the result is not right. I am still not sure why though xD

I have yet to test with the llama7B.

atsetogl avatar Apr 10 '24 20:04 atsetogl

I also tested llama-2-7b-chat.Q4_0.gguf , it only reached 0.11 tok/s in 780M Radeon Graphics (gfx1103)...., is there any way to have better performance? C:\code\llama.cpp\build\bin>.\main -m c:\code\llama-2-7b-chat.Q4_0.gguf -p "introduce shanghai" -n 128 --no-kv-offload -ngl 24 -e -t 4

Log start main: build = 2647 (8228b66d) main: built with for x86_64-pc-windows-msvc main: seed = 1712791923 llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from c:\code\llama-2-7b-chat.Q4_0.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon(TM) 780M, compute capability 11.0, VMM: no llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloaded 24/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 2606.26 MiB llm_load_tensors: CPU buffer size = 3647.87 MiB ................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm_Host KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB ggml_gallocr_reserve_n: reallocating ROCm0 buffer from size 0.00 MiB to 173.04 MiB ggml_gallocr_reserve_n: reallocating ROCm_Host buffer from size 0.00 MiB to 57.01 MiB llama_new_context_with_model: ROCm0 compute buffer size = 173.04 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 57.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 132 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

system_info: n_threads = 4 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = 128, n_keep = 1

introduce shanghaiggml_gallocr_needs_realloc: node inp_embd is not valid ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving to the world. nobody was surprised. China's largest city, Shanghai, has always been a place of contrasts. From the towering skyscrapers and bustling streets of the city center to the tranquil waterways and traditional architecture of the old town, there is always something new to discover. Shanghai is a city of contrasts, a place where the ancient and the modern coexist. The city's rich history, culture, and natural beauty make it a unique and fascinating destination for travelers. Shanghai is a city of contrasts, a llama_print_timings: load time = 16934.81 ms llama_print_timings: sample time = 19.85 ms / 128 runs ( 0.16 ms per token, 6447.39 tokens per second) llama_print_timings: prompt eval time = 40527.34 ms / 5 tokens ( 8105.47 ms per token, 0.12 tokens per second) llama_print_timings: eval time = 1106401.19 ms / 127 runs ( 8711.82 ms per token, 0.11 tokens per second) llama_print_timings: total time = 1147104.15 ms / 132 tokens Log end

lihaofd avatar Apr 11 '24 00:04 lihaofd

is there any way or plan to make it have better performance? thanks!

lihaofd avatar Apr 23 '24 01:04 lihaofd

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 07 '24 01:06 github-actions[bot]