llama.cpp
llama.cpp copied to clipboard
Raspberry Pi 4 4GB
Hi!
Just a report. I've successfully run the LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow at about 10 sec/token. But it looks like we can run powerful cognitive pipelines on a cheap hardware. It's awesome. Thank you!
Hardware : BCM2835 Revision : c03111 Serial : 10000000d62b612e Model : Raspberry Pi 4 Model B Rev 1.1
%Cpu0 : 71.8 us, 14.6 sy, 0.0 ni, 0.0 id, 2.9 wa, 0.0 hi, 10.7 si, 0.0 st %Cpu1 : 77.4 us, 12.3 sy, 0.0 ni, 0.0 id, 10.4 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 81.0 us, 8.6 sy, 0.0 ni, 0.0 id, 10.5 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 77.1 us, 12.4 sy, 0.0 ni, 1.0 id, 9.5 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 3792.3 total, 76.2 free, 3622.9 used, 93.2 buff/cache MiB Swap: 65536.0 total, 60286.5 free, 5249.5 used. 42.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2705518 ubuntu 20 0 5231516 3.3g 1904 R 339.6 88.3 84:16.70 main 102 root 20 0 0 0 0 S 14.2 0.0 29:54.42 kswapd0
main: seed = 1678644466 llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... llama_model_load: n_vocab = 32000 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 2 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: ggml ctx size = 4529.34 MB llama_model_load: memory_size = 512.00 MB, n_mem = 16384 llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin' llama_model_load: .................................... done llama_model_load: model size = 4017.27 MB / num tensors = 291
main: prompt: 'The first man on the moon was ' main: number of tokens in prompt = 9 1 -> '' 1576 -> 'The' 937 -> ' first' 767 -> ' man' 373 -> ' on' 278 -> ' the' 18786 -> ' moon' 471 -> ' was' 29871 -> ' '
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
The first man on the moon was 20 years old and looked^[ lot like me. In fact, when I read about Neil Armstrong during school lessons my fa
It looks it's possible to pack it to a AWS Lambda on ARM Gravitron + S3 weights offloading.
Is it swapping? Instead of opening new issues perhaps these numbers should be collected in issue #34 ("benchmarks?").
@neuhaus kswapd0 process is pretty active
I'm trying to run it on a chromebook but I've hit a segfault :(
one thing I may have done wrong is that I reused the same .bin I created on my mac and it worked there.
@dalnk Do you have swap in the system?
What do you change for be able to run it in the PI. i have a pc 4 times better and crash every i tried @miolini
1.2 tokens/s on a Samsung S22 Ultra running 4 threads.
The S22 obviously has a more powerful processor. But I do not think it is 12 times more powerful. It's likely you could get much faster speeds on the Pi.
I'd be willing to bet that the bottleneck is not the processor.
@MarkSchmidty thank you for sharing your results. I believe my system swaped a lot due to limit size of RAM (4GB RAM, model size 4GB).
Ah, yes. A 3-bit implementation of 7B would fit fully in 4GB of RAM and lead to much greater speeds. This is the same issue as in https://github.com/ggerganov/llama.cpp/issues/97.
3-bit support is a proposed enhancement in GPTQ Quantization (3-bit and 4-bit) #9. GPTQ 3-bit has been shown to have negligible output quality vs uncompressed 16-bit and may even provide better output quality than the current naive 4-bit implementation in llama.cpp while requiring 25% less RAM.
@MarkSchmidty Fingers crossed!
I'm currently unable to build for aarch64 on an RPi 4 due to missing SIMD dot product intrinsics (vdotq_s32). Replacing them with abort()
makes compilation complete but results in a crash at runtime. Changing -mcpu
to include dotprod
results in a runtime crash from an illegal instruction.
ronsor@ronsor-rpi4:~/llama.cpp $ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 2000.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
L1d cache: 128 KiB
L1i cache: 192 KiB
L2 cache: 1 MiB
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm crc32 cpuid```
@Ronsor Could you please share build log?
@Ronsor something wrong with your environment. I UNAME_S: Linux I UNAME_P: unknown I UNAME_M: aarch64
My build log on RPI starts with: I llama.cpp build info: I UNAME_S: Linux I UNAME_P: aarch64 I UNAME_M: aarch64
Raspberry Pi 3 Model B Rev 1.2, 1GB RAM + swap (5.7GB on microSD), also works but very slowly
Hardware : BCM2835
Revision : a22082
Serial : 0000000086ed002f
Model : Raspberry Pi 3 Model B Rev 1.2
@Ronsor something wrong with your environment. I UNAME_S: Linux I UNAME_P: unknown I UNAME_M: aarch64
My build log on RPI starts with: I llama.cpp build info: I UNAME_S: Linux I UNAME_P: aarch64 I UNAME_M: aarch64
Which distro are you using? I'm just on vanilla Raspberry Pi OS. It seems the vdotq change is the issue.
Readding the old dot product code fixed my issue
Now that I fixed that (I'll submit a PR soon), running on an 8GB Pi results in not-terrible performance:
main: seed = 1678806223
llama_model_load: loading model from 'models/llama-7B/ggml-model.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'models/llama-7B/ggml-model.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
1 -> ''
8893 -> 'Build'
292 -> 'ing'
263 -> ' a'
4700 -> ' website'
508 -> ' can'
367 -> ' be'
2309 -> ' done'
297 -> ' in'
29871 -> ' '
29896 -> '1'
29900 -> '0'
2560 -> ' simple'
6576 -> ' steps'
29901 -> ':'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Building a website can be done in 10 simple steps:
Get your domain name. There are plenty of great places to register domains at
main: mem per token = 14368644 bytes
main: load time = 16282.27 ms
main: sample time = 35.33 ms
main: predict time = 31862.14 ms / 1062.07 ms per token
main: total time = 51951.39 ms
~1 token/sec
Hey @Ronsor I'm having the same issue. Could you say exactly what you did to fix ?
@davidrutland Basically undo commit https://github.com/ggerganov/llama.cpp/commit/84d9015c4a91ab586ba65d5bd31a8482baf46ba1 and it should build fine
Not able to build with my Rpi 4 4GB running ubuntu 22.10
I llama.cpp build info:
I UNAME_S: Linux
I UNAME_P: aarch64
I UNAME_M: aarch64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mcpu=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -mcpu=native
I LDFLAGS:
I CC: cc (Ubuntu 12.2.0-3ubuntu1) 12.2.0
I CXX: g++ (Ubuntu 12.2.0-3ubuntu1) 12.2.0
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mcpu=native -c ggml.c -o ggml.o
In file included from ggml.c:137:
/usr/lib/gcc/aarch64-linux-gnu/12/include/arm_neon.h: In function ‘ggml_vec_dot_q4_0’:
/usr/lib/gcc/aarch64-linux-gnu/12/include/arm_neon.h:29527:1: error: inlining failed in call to ‘always_inline’ ‘vdotq_s32’: target specific option mismatch
29527 | vdotq_s32 (int32x4_t __r, int8x16_t __a, int8x16_t __b)
| ^~~~~~~~~
ggml.c:1368:15: note: called from here
1368 | p_1 = vdotq_s32(p_1, v0_1hs, v1_1hs);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/aarch64-linux-gnu/12/include/arm_neon.h:29527:1: error: inlining failed in call to ‘always_inline’ ‘vdotq_s32’: target specific option mismatch
29527 | vdotq_s32 (int32x4_t __r, int8x16_t __a, int8x16_t __b)
| ^~~~~~~~~
ggml.c:1367:15: note: called from here
However, when removed changes added #67 related to vdotq_s32
I was able to build successfully. Just simply git revert the commit.
I trie to run convert-pth-to-ggml.py
script on my Pi and found it always OOM. But running in Mac seems okay.
# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1
~~Then I experienced same core dump issue as @dalnk described, after I copied the bin file generated from Mac to Pi.~~
Turns out I am using the fp16 model therefore it core dumped. It was resolved after I run the correct command.
I would suggest that we should note on the the usage in README that this step is platform agnostic and user should consider run this in a desktop device and copy over if they are running the model in the lower spec devices like Raspberry Pi. WDTY?
@Mestrace I am also getting a segfault core dump, but when quantizing on my desktop with plenty of ram available. what was the wrong and correct commands you ran in relation to the fp16 model?
@octoshrimpy I believe Mestrace is saying you should convert and quantize the model on a desktop computer with a lot of RAM first, then move the ~4GB 4bit quantized mode to your pi.
@MarkSchmidty that is what I am attempting, haha. is 16GB ram free not enough for quantizing 7B?
This is what I'm running into, unsure where to go from here.
Run top
and watch the ./quantize
memory utilisation. Also keep an eye on your disk space.
@octoshrimpy What I did
- Run convert ggml to get the f16 model weights on Mac
- Copy the f16 model weights to Pi
- quantize on Pi
@Mestrace what command did you use for quantizing?
@gjmulder I have 350G of space available, and plenty of ram. quantize immediately crashes with segfault, so there is no ram/disk utilization to view. are there logs I can check, or a INFO level logging I can enable?
I did everything on RPI 4. Just enable swap (8GB+) in your system.