llama.cpp
llama.cpp copied to clipboard
RISC-V (TH1520&D1) benchmark and hack for <1GB DDR device
Hi, Just test on RISC-V board: 4xC910 2.0G TH1520 LicheePi4A (https://sipeed.com/licheepi4a) with 16GB LPDDR4X. about 6s/token without any instruction acceleration, and it should be <5s/token when boost to 2.5GHz.
llama_model_load: ggml ctx size = 668.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: prompt: 'They'
main: number of tokens in prompt = 2
1 -> ''
15597 -> 'They'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
They are now available for sale at the cost of Rs 20,5
main: mem per token = 14368644 bytes
main: load time = 91.25 ms
main: sample time = 39.22 ms
main: predict time = 105365.27 ms / 6197.96 ms per token
main: total time = 129801.62 ms
1xC906 1.0G D1 LicheeRV with 1GB DDR3. about 180s/token without any instruction acceleration, it is very slow due to lack of memory.
main: mem per token = 14368644 bytes
main: load time = 1412.77 ms
main: sample time = 185.77 ms
main: predict time = 3171739.00 ms / 186572.88 ms per token
main: total time = 3609667.50 ms
Note the ggml ctx size is 668MB, not 4668MB, I hack the code for low memory(>=512MB) device to run llama, and it is not use swap memory, as regard sd card as memory will demage sd card soon. Should this feature need add in?
And here is a time-lapse photography for D1 run llama 7B model, it is super slow even in 120X speedup, but it works!
https://user-images.githubusercontent.com/3403712/226168660-a0e9c775-edf7-4895-9b2b-b6addcf7868e.mp4
I am very interested in trying to run your code on 1GB ARM device, feel free to share it in your repo!
@Zepan Unfortunately, we cannot see the video! How did you modify this codebase to support lower ram devices? Currently, I can run on a 6gb phone, but this could be good to fit 4gb phone!
Unfortunately, we cannot see the video!
works in external video player.
1xC906 1.0G D1 LicheeRV with 1GB DDR3. about 180s/token without any instruction acceleration, it is very slow due to lack of memory.
main: mem per token = 14368644 bytes main: load time = 1412.77 ms main: sample time = 185.77 ms main: predict time = 3171739.00 ms / 186572.88 ms per token main: total time = 3609667.50 ms
Note the ggml ctx size is 668MB, not 4668MB, I hack the code for low memory(>=512MB) device to run llama, and it is not use swap memory, as regard sd card as memory will demage sd card soon. Should this feature need add in?
You can apply this patch https://github.com/ggerganov/llama.cpp/pull/294 to cut llama_model_load: memory_size = 512.00 MB
in half (edit: now in master)
I am very interested in trying to run your code on 1GB ARM device, feel free to share it in your repo!
Here is my repo: https://github.com/Zepan/llama.cpp just use mmap to reduce Model memory, as it is readonly. compare to swap method, it won't hurt SD/eMMC, but will have really slow speed due to storage bandwidth.
@Zepan,
Missing some fixes on quantize.cpp
.
error: error(compilation): clang failed with stderr: /home/kassane/llama-sipeed/quantize.cpp:139:33: warning: cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual]
/home/kassane/llama-sipeed/quantize.cpp:140:33: warning: cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual]
/home/kassane/llama-sipeed/quantize.cpp:148:19: error: no member named 'score' in 'gpt_vocab'
/home/kassane/llama-sipeed/quantize.cpp:270:35: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare]
/home/kassane/llama-sipeed/quantize.cpp:274:35: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare]
/home/kassane/llama-sipeed/quantize.cpp:292:31: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare]
/home/kassane/llama-sipeed/quantize.cpp:297:31: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare]
@Zepan,
Missing some fixes on
quantize.cpp
.error: error(compilation): clang failed with stderr: /home/kassane/llama-sipeed/quantize.cpp:139:33: warning: cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual] /home/kassane/llama-sipeed/quantize.cpp:140:33: warning: cast from 'const char *' to 'char *' drops const qualifier [-Wcast-qual] /home/kassane/llama-sipeed/quantize.cpp:148:19: error: no member named 'score' in 'gpt_vocab' /home/kassane/llama-sipeed/quantize.cpp:270:35: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare] /home/kassane/llama-sipeed/quantize.cpp:274:35: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare] /home/kassane/llama-sipeed/quantize.cpp:292:31: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare] /home/kassane/llama-sipeed/quantize.cpp:297:31: warning: comparison of integers of different signs: 'int' and 'std::vector<long>::size_type' (aka 'unsigned long') [-Wsign-compare]
I don't have this error in my repo, and I don't change quantize.cpp. you can comment quantize in Makefile and try again.
you yi si , any more hardware test? say rk3588?
This issue was closed because it has been inactive for 14 days since being marked as stale.