CASALIOY
CASALIOY copied to clipboard
Performance Suggestion / Benchmarks
Max Threads = Poor Performance on 8 thread processor and GGJT model after convert.py
TL:DR - Try setting n_threads to 6 instead of 8 if you have an 8 thread processor. Getting consistently faster results than trying to use all of my 8 threads.
Been doing some testing with a GGJT model to try to get the best performance on a little laptop. I did 2 tests for each change to n_threads. Tests were conducted while nothing else was open.
Results On an 8 thread CPU
n_threads=1
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 14464.13 ms
llama_print_timings: sample time = 20.63 ms / 40 runs ( 0.52 ms per run)
llama_print_timings: prompt eval time = 14463.85 ms / 19 tokens ( 761.26 ms per token)
llama_print_timings: eval time = 38962.48 ms / 39 runs ( 999.04 ms per run)
llama_print_timings: total time = 57510.54 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 14054.52 ms
llama_print_timings: sample time = 24.77 ms / 40 runs ( 0.62 ms per run)
llama_print_timings: prompt eval time = 14054.15 ms / 19 tokens ( 739.69 ms per token)
llama_print_timings: eval time = 50090.37 ms / 39 runs ( 1284.37 ms per run)
llama_print_timings: total time = 69022.43 ms
n_threads=2
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 9662.71 ms
llama_print_timings: sample time = 22.36 ms / 40 runs ( 0.56 ms per run)
llama_print_timings: prompt eval time = 9662.48 ms / 19 tokens ( 508.55 ms per token)
llama_print_timings: eval time = 25339.74 ms / 39 runs ( 649.74 ms per run)
llama_print_timings: total time = 39422.48 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 13699.18 ms
llama_print_timings: sample time = 27.64 ms / 40 runs ( 0.69 ms per run)
llama_print_timings: prompt eval time = 13698.78 ms / 19 tokens ( 720.99 ms per token)
llama_print_timings: eval time = 27051.24 ms / 39 runs ( 693.62 ms per run)
llama_print_timings: total time = 46124.61 ms
n_threads=4
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 9804.36 ms
llama_print_timings: sample time = 29.62 ms / 40 runs ( 0.74 ms per run)
llama_print_timings: prompt eval time = 9803.58 ms / 19 tokens ( 515.98 ms per token)
llama_print_timings: eval time = 22367.64 ms / 39 runs ( 573.53 ms per run)
llama_print_timings: total time = 38015.92 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 7894.51 ms
llama_print_timings: sample time = 23.41 ms / 40 runs ( 0.59 ms per run)
llama_print_timings: prompt eval time = 7894.35 ms / 19 tokens ( 415.49 ms per token)
llama_print_timings: eval time = 17166.80 ms / 39 runs ( 440.17 ms per run)
llama_print_timings: total time = 29655.03 ms
n_threads=6
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 8732.21 ms
llama_print_timings: sample time = 29.93 ms / 40 runs ( 0.75 ms per run)
llama_print_timings: prompt eval time = 8731.88 ms / 19 tokens ( 459.57 ms per token)
llama_print_timings: eval time = 26798.23 ms / 39 runs ( 687.13 ms per run)
llama_print_timings: total time = 41384.27 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 4623.47 ms
llama_print_timings: sample time = 21.79 ms / 40 runs ( 0.54 ms per run)
llama_print_timings: prompt eval time = 4623.19 ms / 19 tokens ( 243.33 ms per token)
llama_print_timings: eval time = 17870.62 ms / 39 runs ( 458.22 ms per run)
llama_print_timings: total time = 26962.23 ms
n_threads=7 (Seems better than 8, but not as good as 6)
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 13266.94 ms
llama_print_timings: sample time = 22.37 ms / 40 runs ( 0.56 ms per run)
llama_print_timings: prompt eval time = 13266.64 ms / 19 tokens ( 698.24 ms per token)
llama_print_timings: eval time = 31370.05 ms / 39 runs ( 804.36 ms per run)
llama_print_timings: total time = 49092.33 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 9676.00 ms
llama_print_timings: sample time = 30.28 ms / 40 runs ( 0.76 ms per run)
llama_print_timings: prompt eval time = 9675.46 ms / 19 tokens ( 509.23 ms per token)
llama_print_timings: eval time = 51035.98 ms / 39 runs ( 1308.61 ms per run)
llama_print_timings: total time = 66633.10 ms
n_threads=8 (Max threads)
Test 1
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 31573.62 ms
llama_print_timings: sample time = 23.12 ms / 40 runs ( 0.58 ms per run)
llama_print_timings: prompt eval time = 31573.35 ms / 19 tokens ( 1661.76 ms per token)
llama_print_timings: eval time = 80649.37 ms / 39 runs ( 2067.93 ms per run)
llama_print_timings: total time = 119573.09 ms
Test 2
1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
llama_print_timings: load time = 31926.09 ms
llama_print_timings: sample time = 22.00 ms / 40 runs ( 0.55 ms per run)
llama_print_timings: prompt eval time = 31925.73 ms / 19 tokens ( 1680.30 ms per token)
llama_print_timings: eval time = 67654.42 ms / 39 runs ( 1734.73 ms per run)
llama_print_timings: total time = 103776.36 ms
Script used for benchmarking: Requires llama-cpp-python==0.1.49
import json
import argparse
from llama_cpp import Llama
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", type=str, default="./newggjt.bin")
args = parser.parse_args()
llm = Llama(model_path=args.model, n_threads=6)
stream = llm(
"Question: What are the names of the planets in the solar system? Answer: ",
max_tokens=48,
stop=["Q:", "\n"],
stream=True,
)
for output in stream:
print(output["choices"][0]["text"], end="")
#print(json.dumps(output, indent=2))