llama.cpp
llama.cpp copied to clipboard
`build`: generate hex dump of server assets during build
Currently one needs to manually regenerate asset files in examples/server
, which:
- is easy to forget
- ~~may cause a security risk~~ (reviewers are unlikely to unescape the code - don't wanna be the next target after xz) (edit: already mitigated by https://github.com/ggerganov/llama.cpp/pull/6409 as pointed out below)
od
is used in Makefile
(unlike xxd
, it should be available on ~all systems, incl. on Windows w/ w64devkit
- thanks to busybox), while cmake
and zig
builds cast their own hex for portability (may have made me n-curse).
It has been addressed the xz weekend here:
- #6409
But I am happy that finally the code is removed, so looks good
Please pay attention that the idea may look simple on the surface, but it's get messy when taking into account that we're not compiling for single target.
- Some linux distro (docker/CI) may not have
xxd
by default. You may need to add something likeapt install xxd
to every dockerfile and compile scripts in the project. This will also create a new dependency in compilation phase. In fact, a while ago, I success in making a script that does the same thing withxxd
but usingod
instead. The reason was becauseod
is included by default in most linux distros. See this article for more details. - Things get messy when talking about Windows and MacOS. In fact, I have 0 knowledge how to compile on these targets, so I gave up.
Edit: sorry I deleted the od
script that I mentioned above..
@phymbert great to read you've already mitigated the issue, I'd missed that commit in the flow.
Things get messy when talking about Windows and MacOS. In fact, I have 0 knowledge how to compile on these targets, so I gave up.
@ngxson actually I've just hit a snag by using -n
that my BSD xxd likes (on MacOS), and GNU xxd has no clue about (on CI) - fixed.
Re/ Windows, I realize I've assumed people build from within WSL (or cross-build from Linux), but would need confirmation.
In the absolute worst case we could write our own minimalistic xxd, I suppose.
Re/ Windows, I realize I've assumed people build from within WSL (or cross-build from Linux), but would need confirmation.
We cannot assume unfortunately, I have the feeling most windows developper are using Visual Studio.
In the absolute worst case we could write our own minimalistic xxd, I suppose.
I would prefer to have a classical npm build then. Also what about:
- https://stackoverflow.com/questions/4158900/embedding-resources-in-executable-using-gcc
I would prefer to have a classical npm build then.
@phymbert possibly inevitable in the long run, would unlock new horizons...
Also what about: https://stackoverflow.com/questions/11813271/embed-resources-eg-shader-code-images-into-executable-library-with-cmake
Hah, nifty! Might need some contortions to get something similar to work w/ MSVC, but alternatively... looks like we could do it in cmake: https://stackoverflow.com/questions/11813271/embed-resources-eg-shader-code-images-into-executable-library-with-cmake
Which means out of the 3 documented ways to build on Windows:
- (MS)
make
+w64devkit
: the latter ships w/ busybox, which seems to havexxd
-
cmake
: can do the HEX in CMakeLists.txt -
zig
: same (TBC, I'm not sure how much I like the language so far :-D)
(so xxd in Makefile for all platforms, and adhoc hexing in the other build scripts 🤞)
Good job! The read file as HEX trick in cmake seem to resolve the problem on both mac & windows.
For make
, there're users still using make
on linux, and xxd
may not be available by default, at least it's not installed on my fedora installation and my ubuntu distrobox. Of course I can simply apt install
, but I'm a bit doubt that this change may break compilation for someone else.
In the absolute worst case we could write our own minimalistic xxd, I suppose.
This does not sound really awful - we can give it a try if we encounter issues with the proposed PR
One idea to make things even more user friendly is to have a minimal index.html
committed and loaded when the hex conversion has not been applied (for whatever reason). That minimal index.html
can just show information about the necessary steps to generate the UI
📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2
-q4_0
: 460 iterations 🚀
Expand details for performance related PR only
- Concurrent users: 8, duration: 10m
- HTTP request : avg=10214.07ms p(95)=26164.51ms fails=, finish reason: stop=412 truncated=48
- Prompt processing (pp): avg=112.42tk/s p(95)=549.33tk/s
- Token generation (tg): avg=24.22tk/s p(95)=36.95tk/s
- ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=generate-assets commit=b9286a4d7b929b077f1d05395e2bc9ecadf5a99a
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1713214386 --> 1713215020
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 488.34, 488.34, 488.34, 488.34, 488.34, 751.63, 751.63, 751.63, 751.63, 751.63, 475.07, 475.07, 475.07, 475.07, 475.07, 505.87, 505.87, 505.87, 505.87, 505.87, 536.01, 536.01, 536.01, 536.01, 536.01, 581.01, 581.01, 581.01, 581.01, 581.01, 583.14, 583.14, 583.14, 583.14, 583.14, 584.69, 584.69, 584.69, 584.69, 584.69, 595.22, 595.22, 595.22, 595.22, 595.22, 599.72, 599.72, 599.72, 599.72, 599.72, 601.64, 601.64, 601.64, 601.64, 601.64, 610.24, 610.24, 610.24, 610.24, 610.24, 612.16, 612.16, 612.16, 612.16, 612.16, 633.3, 633.3, 633.3, 633.3, 633.3, 623.27, 623.27, 623.27, 623.27, 623.27, 631.07, 631.07, 631.07, 631.07, 631.07, 636.81, 636.81, 636.81, 636.81, 636.81, 554.47, 554.47, 554.47, 554.47, 554.47, 558.73, 558.73, 558.73, 558.73, 558.73, 559.12, 559.12, 559.12, 559.12, 559.12, 566.66, 566.66, 566.66, 566.66, 566.66, 578.9, 578.9, 578.9, 578.9, 578.9, 583.16, 583.16, 583.16, 583.16, 583.16, 584.08, 584.08, 584.08, 584.08, 584.08, 590.17, 590.17, 590.17, 590.17, 590.17, 591.54, 591.54, 591.54, 591.54, 591.54, 595.02, 595.02, 595.02, 595.02, 595.02, 608.11, 608.11, 608.11, 608.11, 608.11, 605.23, 605.23, 605.23, 605.23, 605.23, 610.26, 610.26, 610.26, 610.26, 610.26, 612.08, 612.08, 612.08, 612.08, 612.08, 621.9, 621.9, 621.9, 621.9, 621.9, 619.29, 619.29, 619.29, 619.29, 619.29, 620.5, 620.5, 620.5, 620.5, 620.5, 621.21, 621.21, 621.21, 621.21, 621.21, 624.94, 624.94, 624.94, 624.94, 624.94, 629.13, 629.13, 629.13, 629.13, 629.13, 628.32, 628.32, 628.32, 628.32, 628.32, 630.71, 630.71, 630.71, 630.71, 630.71, 637.29, 637.29, 637.29, 637.29, 637.29, 643.9, 643.9, 643.9, 643.9, 643.9, 647.81, 647.81, 647.81, 647.81, 647.81, 654.42, 654.42, 654.42, 654.42, 654.42, 654.71, 654.71, 654.71, 654.71, 654.71, 654.36, 654.36, 654.36, 654.36, 654.36, 657.47, 657.47, 657.47, 657.47, 657.47, 661.36, 661.36, 661.36, 661.36, 661.36, 669.77, 669.77, 669.77, 669.77, 669.77, 664.06, 664.06, 664.06, 664.06, 664.06, 660.33, 660.33, 660.33, 660.33, 660.33, 658.81, 658.81, 658.81, 658.81, 658.81, 658.35, 658.35, 658.35, 658.35, 658.35, 657.23, 657.23, 657.23, 657.23, 657.23, 655.95, 655.95, 655.95, 655.95, 655.95, 654.2, 654.2, 654.2, 654.2, 654.2, 657.69, 657.69, 657.69, 657.69, 657.69, 660.03, 660.03, 660.03, 660.03, 660.03, 659.97, 659.97, 659.97, 659.97, 659.97, 661.12, 661.12, 661.12, 661.12, 661.12, 663.23, 663.23, 663.23, 663.23, 663.23, 667.18, 667.18, 667.18, 667.18, 667.18, 667.18, 667.18, 667.18]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1713214386 --> 1713215020
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 35.6, 35.6, 35.6, 35.6, 35.6, 35.6, 35.6, 35.6, 35.6, 35.6, 22.94, 22.94, 22.94, 22.94, 22.94, 24.33, 24.33, 24.33, 24.33, 24.33, 25.31, 25.31, 25.31, 25.31, 25.31, 25.37, 25.37, 25.37, 25.37, 25.37, 25.46, 25.46, 25.46, 25.46, 25.46, 26.09, 26.09, 26.09, 26.09, 26.09, 26.26, 26.26, 26.26, 26.26, 26.26, 26.15, 26.15, 26.15, 26.15, 26.15, 26.09, 26.09, 26.09, 26.09, 26.09, 25.26, 25.26, 25.26, 25.26, 25.26, 25.14, 25.14, 25.14, 25.14, 25.14, 25.23, 25.23, 25.23, 25.23, 25.23, 24.45, 24.45, 24.45, 24.45, 24.45, 24.45, 24.45, 24.45, 24.45, 24.45, 23.95, 23.95, 23.95, 23.95, 23.95, 23.74, 23.74, 23.74, 23.74, 23.74, 23.74, 23.74, 23.74, 23.74, 23.74, 23.81, 23.81, 23.81, 23.81, 23.81, 23.74, 23.74, 23.74, 23.74, 23.74, 23.46, 23.46, 23.46, 23.46, 23.46, 23.22, 23.22, 23.22, 23.22, 23.22, 23.2, 23.2, 23.2, 23.2, 23.2, 22.99, 22.99, 22.99, 22.99, 22.99, 23.03, 23.03, 23.03, 23.03, 23.03, 23.18, 23.18, 23.18, 23.18, 23.18, 22.92, 22.92, 22.92, 22.92, 22.92, 23.0, 23.0, 23.0, 23.0, 23.0, 23.07, 23.07, 23.07, 23.07, 23.07, 23.17, 23.17, 23.17, 23.17, 23.17, 22.96, 22.96, 22.96, 22.96, 22.96, 22.74, 22.74, 22.74, 22.74, 22.74, 22.8, 22.8, 22.8, 22.8, 22.8, 23.05, 23.05, 23.05, 23.05, 23.05, 23.15, 23.15, 23.15, 23.15, 23.15, 23.16, 23.16, 23.16, 23.16, 23.16, 23.33, 23.33, 23.33, 23.33, 23.33, 23.42, 23.42, 23.42, 23.42, 23.42, 23.32, 23.32, 23.32, 23.32, 23.32, 23.29, 23.29, 23.29, 23.29, 23.29, 23.17, 23.17, 23.17, 23.17, 23.17, 23.08, 23.08, 23.08, 23.08, 23.08, 23.11, 23.11, 23.11, 23.11, 23.11, 23.24, 23.24, 23.24, 23.24, 23.24, 23.39, 23.39, 23.39, 23.39, 23.39, 23.43, 23.43, 23.43, 23.43, 23.43, 23.36, 23.36, 23.36, 23.36, 23.36, 23.12, 23.12, 23.12, 23.12, 23.12, 22.87, 22.87, 22.87, 22.87, 22.87, 22.7, 22.7, 22.7, 22.7, 22.7, 22.61, 22.61, 22.61, 22.61, 22.61, 22.17, 22.17, 22.17, 22.17, 22.17, 21.16, 21.16, 21.16, 21.16, 21.16, 21.14, 21.14, 21.14, 21.14, 21.14, 21.14, 21.14, 21.14, 21.14, 21.14, 21.17, 21.17, 21.17, 21.17, 21.17, 21.29, 21.29, 21.29, 21.29, 21.29, 21.28, 21.28, 21.28, 21.28, 21.28, 21.46, 21.46, 21.46, 21.46, 21.46, 21.53, 21.53, 21.53]
Details
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1713214386 --> 1713215020
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.04, 0.04, 0.04, 0.04, 0.04, 0.3, 0.3, 0.3, 0.3, 0.3, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.25, 0.25, 0.25, 0.25, 0.25, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.21, 0.21, 0.21, 0.21, 0.21, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.35, 0.35, 0.35, 0.35, 0.35, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.29, 0.29, 0.29, 0.29, 0.29, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.08, 0.08, 0.08, 0.08, 0.08, 0.23, 0.23, 0.23, 0.23, 0.23, 0.44, 0.44, 0.44, 0.44, 0.44, 0.6, 0.6, 0.6, 0.6, 0.6, 0.58, 0.58, 0.58, 0.58, 0.58, 0.52, 0.52, 0.52, 0.52, 0.52, 0.55, 0.55, 0.55, 0.55, 0.55, 0.45, 0.45, 0.45, 0.45, 0.45, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.06, 0.06, 0.06, 0.06, 0.06, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2]
More
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 460 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1713214386 --> 1713215020
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0]
Nomic's Kompute fork builds xxd from source (it's a single .c source file) unconditionally: https://github.com/nomic-ai/kompute/blob/d1e3b0953cf66acc94b2e29693e221427b2c1f3f/CMakeLists.txt#L187
Nomic's Kompute fork builds xxd from source (it's a single .c source file) unconditionally: https://github.com/nomic-ai/kompute/blob/d1e3b0953cf66acc94b2e29693e221427b2c1f3f/CMakeLists.txt#L187
@cebtenzzre hah great, so we kinda always have xxd after all... (assuming one does git clone --recursive
, which I didn't)
In the absolute worst case we could write our own minimalistic xxd, I suppose.
This does not sound really awful - we can give it a try if we encounter issues with the proposed PR
@ggerganov @ngxson I've switched the Makefile to using od
& wc
(should work on Windows w/ w64devkit through busybox 🤞). None of the 3 hex-dumping methods are exactly pretty, but they're relatively short (building xxd
would probably take the same amount of build script lines, or more b/c of two-phased builds).
Feel free to merge when ready
I think this kind of break rocm for cmake in windows. debugbuild.txt
@sorasoras glancing through it, think it might related to https://github.com/ggerganov/llama.cpp/pull/6797 @jart FYI