llama.cpp
llama.cpp copied to clipboard
[User] nonsense responses with q2_k llama in Termux when using GPU
- [x] I am running the latest code. 794db3e
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.
./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
runs, but produces nonsense responses. To clarify, without -ngl works as expected.
LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
main: build = 0 (unknown)
main: seed = 1686999485
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q2_K.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required = 4383.18 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/35 layers to GPU
llama_model_load_internal: total VRAM used: 81 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB
system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
> hi Samantha
Amts inheritanceтарасла inwonolasSEEhalbźyn osccius dolensaieri corsoistetximdebugliminputdebuglia Savjönlin corso background
inheritanceieriieri Sav Encyclopediaiste Хро invånare octIABPush Lit oscattanleb Albattan Nation Cur Podpoisattan SE smdebugérezpois dressyen Savunneldebugassets Alb
Albattanźnit хиattandy Mann Overflowirection podlectee curveLENGarusuenpgfein Хроertenistepois oscź�
>
I tested open-llama-7B-open-instruct.ggmlv3.q2_K and had the same result.
Environment and Context
Here's clinfo (native OpenCL);
LD_LIBRARY_PATH=/vendor/lib64 clinfo
Number of platforms 1
Platform Name QUALCOMM Snapdragon(TM)
Platform Vendor QUALCOMM
Platform Version OpenCL 2.0 QUALCOMM build: commit #3dad7f8ed7 changeid #I593c16c433 Date: 10/01/21 Fri Local Branch: Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.9.1.R1.11.00.00.604.073
Platform Profile FULL_PROFILE
Platform Extensions
Platform Name QUALCOMM Snapdragon(TM)
Number of devices 1
Device Name QUALCOMM Adreno(TM)
Device Vendor QUALCOMM
Device Vendor ID 0x5143
Device Version OpenCL 2.0 Adreno(TM) 640
Driver Version OpenCL 2.0 QUALCOMM build: commit #3dad7f8ed7 changeid #I593c16c433 Date: 10/01/21 Fri Local Branch: Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.9.1.R1.11.00.00.604.073 Compiler E031.37.12.01
Device OpenCL C Version OpenCL C 2.0 Adreno(TM) 640
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 2
Max clock frequency 1MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 1024
Preferred work group size multiple (kernel) 128
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 0
half 1 / 1 (cl_khr_fp16)
float 1 / 1
double 0 / 0 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity Yes
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity Yes
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 64, Little-Endian
Global memory size 3911952384 (3.643GiB)
Error Correction support No
Max memory allocation 977988096 (932.7MiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing Yes
Fine-grained system sharing No
Atomics Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Page size (QCOM) 4096 bytes
External memory padding (QCOM) 0 bytes
Preferred alignment for atomics
SVM 128 bytes
Global 0 bytes
Local 0 bytes
Max size for global variable 65536 (64KiB)
Preferred total size of global vars 1048576 (1024KiB)
Global Memory cache type Read/Write
Global Memory cache size 131072 (128KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 64 bytes
Pitch alignment for 2D image buffers 64 pixels
Max 2D image size 16384x16384 pixels
Max 3D image size 16384x16384x2048 pixels
Max number of read image args 128
Max number of write image args 64
Max number of read/write image args 64
Max number of pipe args 16
Max active pipe reservations 7680
Max pipe packet size 1024
Local memory type Local
Local memory size 32768 (32KiB)
Max number of constant args 8
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 655376 (640KiB)
Max size 655376 (640KiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_khr_3d_image_writes cl_img_egl_image cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_egl_event cl_khr_egl_image cl_khr_fp16 cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_image2d_from_buffer cl_khr_mipmap_image cl_khr_srgb_image_writes cl_khr_subgroups cl_qcom_create_buffer_from_image cl_qcom_ext_host_ptr cl_qcom_ion_host_ptr cl_qcom_perf_hint cl_qcom_other_image cl_qcom_subgroup_shuffle cl_qcom_vector_image_ops cl_qcom_extract_image_plane cl_qcom_android_native_buffer_host_ptr cl_qcom_protected_context cl_qcom_priority_hint cl_qcom_compressed_yuv_image_read cl_qcom_compressed_image cl_qcom_ext_host_ptr_iocoherent cl_qcom_accelerated_image_ops cl_qcom_ml_ops
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [P0]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name QUALCOMM Snapdragon(TM)
Device Name QUALCOMM Adreno(TM)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name QUALCOMM Snapdragon(TM)
Device Name QUALCOMM Adreno(TM)
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name QUALCOMM Snapdragon(TM)
Device Name QUALCOMM Adreno(TM)
lscpu;
Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: Qualcomm Model name: Kryo-4XX-Silver Model: 14 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 0xd CPU(s) scaling MHz: 62% CPU max MHz: 1785.6000 CPU min MHz: 300.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics f php asimdhp cpuid asimdrdm lrcpc dcpop asimddp Model name: Kryo-4XX-Gold Model: 14 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 Stepping: 0xd CPU(s) scaling MHz: 71% CPU max MHz: 2841.6001 CPU min MHz: 710.4000 BogoMIPS: 38.40 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics f php asimdhp cpuid asimdrdm lrcpc dcpop asimddp Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Vulnerable Spec store bypass: Vulnerable Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; Branch predict or hardening Srbds: Not affected Tsx async abort: Not affected
uname -a
Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android
- SDK version, e.g. for Linux:
Python 3.11.4
GNU Make 4.4.1
cmake version 3.26.4
clang version 16.0.6
Target: aarch64-unknown-linux-android24
Thread model: posix
InstalledDir: /data/data/com.termux/files/usr/bin
Steps to Reproduce
- Build llama.cpp with CLBlast enabled
- load q2_k model with -ngl # parameter
- Query the model
Thank you!
I had such problem.
1 - check if you has newest build 2 - if yes then check if your model is not corrupted
I had those 2 problems.
I had such problem.
1 - check if you has newest build 2 - if yes then check if your model is not corrupted
I had those 2 problems.
Yes. [86c7571] and I tested with 2 different q2_k models.
have you tried q4 model for test?
could it be that -ngl 1 works only on apple silicon systems? you seem to be on a linux maschine.
have you tried q4 model for test?
Thanks for your response. I tested it now with open-llama-7B-open-instruct.ggmlv3.q4_0 and it's functional, working as expected.
The issue is with q2_k models specifically.
could it be that -ngl 1 works only on apple silicon systems? you seem to be on a linux maschine.
The ngl parameter functions with OpenCL through CLBlast even on my device: Android with Termux.
Seems qk models are not fully supported on arm ( linux ? ) devices ...
Seems qk models are not fully supported on arm ( linux ? ) devices ...
I'm downloading a 3_k_s model now, but I can't test until later tonight, so I'll let you know how it goes.
It's an Android device with Termux. edit: 3_k_s model functional (no gibberish)
LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q3_K_S.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
main: build = 0 (unknown)
main: seed = 1687028632
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/samantha-1.1-llama-7b.ggmlv3.q3_K_S.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 11 (mostly Q3_K - Small)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required = 4471.30 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/35 layers to GPU
llama_model_load_internal: total VRAM used: 83 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB
system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
> hi, hows it?
Hello! I'm doing well and eager to learn more about your day. What can I help you with today?
Are you sure you q2_k model is not broken :P If not then is something wrong with arm built and q2_k. Anyway q2_k is useless and shouldn't be used ... too much lobotomy .
Are you sure you q2_k model is not broken :P If not then is something wrong with arm built and q2_k.
The q2_k models work with -ngl 0 (disabled), so yes I'm sure the .bin for Samantha & Open Llama are not corrupt.
Anyway q2_k is useless and shouldn't be used ... too much lobotomy .
I cannot reproduce it on a PC using OpenCL.
Here is what I get, looks perfectly reasonable:
llama_init_from_file: kv self size = 1024,00 MB
system_info: n_threads = 3 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
> hi Samantha
Hello! I'm happy to be here for you, ready to support and guide you through any situation.
> Tell me what you know about Hacker News
Hacker News is a popular news website that covers tech-related topics such as startup news, hacking, and coding projects. It features articles written by both professionals and amateurs in the field, providing an opportunity for open conversations between people who share a common interest in technology and related subjects.
Btw, using -ngl 1
will load a single layer on the GPU. If the model fits completely in VRAM, it is better to use -ngl 100
.
I can reproduce it on a PC using OpenCL.
Here is what I get, looks perfectly reasonable:
llama_init_from_file: kv self size = 1024,00 MB system_info: n_threads = 3 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction: ' sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000 generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. > hi Samantha Hello! I'm happy to be here for you, ready to support and guide you through any situation. > Tell me what you know about Hacker News Hacker News is a popular news website that covers tech-related topics such as startup news, hacking, and coding projects. It features articles written by both professionals and amateurs in the field, providing an opportunity for open conversations between people who share a common interest in technology and related subjects.
Hi,
I don't see a reproduction in your message. Are you saying you're able to produce the nonsense with a q2_k model on PC?
Btw, using
-ngl 1
will load a single layer on the GPU. If the model fits completely in VRAM, it is better to use-ngl 100
.
Increasing -ngl # slows inference: #1718
Sorry, typo. I meant "cannot", not "can".
Sorry, typo. I meant "cannot", not "can".
Thanks for clarifying. I'm thinking it may be an ARM device specific issue, like mirek190 mentioned.
Even with clblast the error is gone if I doesn't offload layers.
Yes, q2_k functions normal through CLBlast without offload.
Small update, same results:
Built ba4e85a including CLBlast using open-llama-13b-q2_k:
~/c/build> cd bin
u0_a1282@localhost ~/c/b/bin> LD_LIBRARY_PATH=/vendor/lib64 ./main -m ~/llama.cpp/models/open-llama-13b-q2_K.bin --color -c 2048 --keep -1 -t 3 -b 7 -i -ins -ngl 1
main: build = 0 (unknown)
main: seed = 1687193815
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM)'
ggml_opencl: device FP16 support: true
llama.cpp: loading model from /data/data/com.termux/files/home/llama.cpp/models/open-llama-13b-q2_K.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required = 7097.25 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/43 layers to GPU
llama_model_load_internal: total VRAM used: 127 MB
....................................................................................................
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 3 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 7, n_predict = -1, n_keep = 2
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
> Hi. What's a fun thing to do at the beach?
Loopijarowsefore Ring conduct under privilege loop Loop rare foreannonanu rare Foreunders encitiroeitles scale Cro currencyfore
llama_print_timings: load time = 16974.21 ms
llama_print_timings: sample time = 59.59 ms / 25 runs ( 2.38 ms per token, 419.51 tokens per second)
llama_print_timings: prompt eval time = 74631.28 ms / 35 tokens ( 2132.32 ms per token, 0.47 tokens per second)
llama_print_timings: eval time = 191464.64 ms / 24 runs ( 7977.69 ms per token, 0.13 tokens per second)
llama_print_timings: total time = 298827.60 ms
I didn't expect a change, but wanted to provide additional information. Per the results, even 13b q2_k produces nonsense.
Thank you.
Same issue. Using termux on sm8250 (snapdragon 870) with 8gb memory, built on latest commit on the master branch, getting gibberish output with offloading ( -ngl 1 to 35) with llama-7b.ggmlv3.q2_K.bin model.
@JackJollimore you fit a 13b model on a 8gb phone? (7gb used?) Is there any custom rom useable to free the extra 2gb the android system is using? Such that you can repurpose them to run like cheap low power sbc servers?
Thanks for your response. My device is 8GB RAM but there's also 8GB virtual RAM in the settings. Edit: to clarify, it's stock OS Android, no root.
When loading greater than 8GB of RAM then it's quite slow, but yes it functions.
The 13B q2_k Max Ram is 8.01 GB vs. 13b q4_0 at 9.82 GB which is significant when it comes to inference speed for a model that size on a device like mine.
I noticed the same actually, I was using the Orange Pi 5B which ships with some custom Android and vendor OpenCL.
Solved by recent pull. llama-7b.ggmlv3.q2_K.bin
training a neural network is done in following steps:
- Preparing the training data sets (Inputs - outputs)
- Training the Neural Network
- Testing the accuracy of Neural Network Neural Network Algorithm: The basic approach towards learning by Artificial Intelligence, the most successful one up to now is called neural...
On Tue, Jul 4, 2023, 3:27 PM Henri Vasserman @.***> wrote:
I noticed the same actually, I was using the Orange Pi 5B which ships with some custom Android and vendor OpenCL.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/1909#issuecomment-1620111899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5KKLDDEDR5M27QDBEUFO3XOQAKFANCNFSM6AAAAAAZKHZYKE . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>
Solved by recent pull.
I pulled today. Here's my result with -ins:
> ./main -m ~/open-llama-7b-q2_K.bin -i -ins -ngl 1
...
> please list 5 movies.
licensed|unlicensed|
------------|----|
1|1|
2|0|
3|0|
4|1|
5|0|
6|0|
7|0|
8|1|
9|1|
10|0|
11|1|
12|0|
13
with --prompt:
> ./main -m ~/open-llama-7b-q2_K.bin -i -ngl 1 -p "Please list 5 movies."
...
> Please list 5 movies.aden is a gambler!
Aden is a gambler!
Aden is a gambler! is a list by zebra_69 on Listal.
No users voted For the Love of Mike
zebra_69
41 items...
The list contains 1 items. No items are shared with this list.
© 2005-2013 listal.com All rights reserved. Contact Us Privacy policy About Us
This page has been served 0 d since Wed Mar 7 2
Edit: Samantha model really highlights the error:
./main -m ~/samantha-1.1-llama-7b.ggmlv3.q2_K.bin -i -ins -ngl 1
...
> Hi Samantha
Хромой теплёгой ветерок ставит мечты.
> Good day Samantha, please list some movies.
package com.opengamma.util.function;
import java.io.Serializable;
/**
* Utility class containing common mathematical operations as static methods for convenience.
* <p>
* The goal is to provide a simple, easy-to-use and efficient interface for mathematical operations,
* target
>
Edit 2: I've noticed that a prompt template significantly improves the quality of the response from 2_k models. Here's ./server Samantha:
./server -m ~/samantha-1.1-llama-7b.ggmlv3.q2_K.bin -ngl 1
...
User: Hello Samantha. Please list some movies.
Samantha: Хродингу! Hi there! I'd be happy to share a few films with you. Here are a few popular choices that have stood the test of time:
1. "The Godfather" (1972)
2. "Pulp Fiction" (1994)
3. "The Shawshank Redemption" (1994)
The model starts with garble consistently, but it's definitely improved since posting.
#2133 shows perplexity for GPU on Android is bugged.
More on this: latest build, snapdragon 8 Gen 2, termux
- Qn_K models are garbled with -ngl > 0, but work fine with -ngl = 0
- Qn_0 legacy format models work fine with -ngl > 0
Separately GPU off-loading, when it works,decreases performance. Probably memory bandwidth issue.
Hello, were you able to fix this issue?
More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux.
Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML models work okay.
Tryed with Mistral-7B GGUF and Marx-3B GGML.
The problem occurs with CuBLAS or OpenBLAS, no difference.
More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux.
Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML models work okay.
Tryed with Mistral-7B GGUF and Marx-3B GGML.
The problem occurs with CuBLAS or OpenBLAS, no difference.
Do you see performance degradation in terms of speed on 8 gen 1 gpu as compared to running model on cpu
More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux. Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML models work okay. Tryed with Mistral-7B GGUF and Marx-3B GGML. The problem occurs with CuBLAS or OpenBLAS, no difference.
Do you see performance degradation in terms of speed on 8 gen 1 gpu as compared to running model on cpu
By my tests the prompt processing is way faster, but the token generation is slower, indeed. I'm just using to process the prompt.
This issue was closed because it has been inactive for 14 days since being marked as stale.