exllama
exllama copied to clipboard
Slower tokens/s than expecting
Hello I am running a 2x 4090 PC, Windows, with exllama on 7b llama-2.
I am only getting ~70-75t/s during inference (using just 1x 4090), but based on the charts, I should be getting 140+t/s.
What could be causing?
I know on Windows, Hardware-Accelerated GPU Scheduling can make a big difference to performance, so you might try enabling that.
But even without that you should be seeing more t/s on a single 4090. There could be something keeping the GPU occupied or power limited, or maybe your CPU is very slow?
I recently added the --affinity
argument which you could try. It will pin the process to the listed cores, just in case Windows tries to schedule ExLlama on efficiency cores for some reason. E.g. run with --affinity 0,1,2,3,4,5,6,7
or whatever is appropriate for your CPU.
hmm, my CPU shouldnt be slow (13700k), but it may not be using everything it needs to, it seems to not be using all cores,
Do I set --affinity as an arg to any of my inference scripts, it didn't seem to affect the CPU usage or speed much: Time taken for Response: 8.5787 seconds tokens total: 696 tokens/second: 81.12
For reference my inference code:
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import os, glob, time
# Directory containing model, tokenizer, generator
model_directory = "C:\Teknium\Models\StableBeluga-7B-GPTQ\\"
# Locate files we need within that directory
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]
# Create config, model, tokenizer and generator
config = ExLlamaConfig(model_config_path) # create config from config.json
config.model_path = model_path # supply path to model weights file
model = ExLlama(config) # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path) # create tokenizer from tokenizer model file
cache = ExLlamaCache(model) # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache) # create generator
# Configure generator
#generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
# Produce a simple generation
prompt = "### Instruction:\nWrite a story about a dog getting out of jail\n### Response:\n"
print (prompt, end = "")
start_time = time.time()
output = generator.generate_simple(prompt, max_new_tokens = 2048)
print(output[len(prompt):])
end_time = time.time() # End timing
elapsed_time = end_time - start_time # Calculate time taken for the iteration
print(f"Time taken for Response: {elapsed_time:.4f} seconds")
print(f"tokens total: {len(tokenizer.encode(output[len(prompt):]).tolist()[0])}")```
Launching that script with --affinity 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 yields these graphs when inferencing:
--affinity
would only matter if for some reason the OS scheduler isn't doing its job properly and assigning the process to performance cores, which it should do automatically. The fact that some cores are hitting 100% doesn't mean you're CPU bound, though. PyTorch/CUDA will always do that, no matter what. It doesn't yield available CPU time while synchronizing to the GPU.
Do you have hardware-accelerated GPU scheduling enabled? And is there anything else using the same GPU, like an animated Windows wallpaper or something? Long shot, I know, but it's worth ruling it out just to be sure.
Will --affinity work no matter if the script directly implements something to handle it?
I am now getting in-line expected speeds for multigpu 70b inference, about 13.5t/s average - and I do get a boost on 7b - from 78->86 tok/s, after upgrading to windows 11, but 7b is still almost 45% slower than it should be. I disabled hardware-accelerated GPU scheduling, will lyk when I restart and it is disabled. I will try isolating to the 2nd gpu that has no display attached and see if the speed is faster, would I do that by setting device_map to [0,24]?
Hardware accelerated GPU scheduling should preferably be enabled, not disabled. But idk. Windows is odd sometimes.
To run on just the second GPU, yes, set the device map as you suggest.
I'm curious though. Have you tried just running the benchmark script? python test_benchmark_inference.py -d <your model dir> -p
? It's possible that it's the sampler slowing you down and not the model itself.
Also, what NVIDIA driver version are you on? Apparently everyone has been seeing a big drop in performance after version 535.something.
My driver is 31.0.15.3667 (Nvidia 536.67)
Will try with benchmark script.
That's definitely one of the newer drivers that people have been having issues with. You might want to try on 531.x.
Will update when I downgrade the drivers and do the benchmark script
Updating on benchmark script: Haven't rolled back driver yet
-- Tokenizer: C:\Teknium\Models\StableBeluga-7B-GPTQ\tokenizer.model
-- Model config: C:\Teknium\Models\StableBeluga-7B-GPTQ\config.json
-- Model: C:\Teknium\Models\StableBeluga-7B-GPTQ\gptq_model-4bit-128g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --sdp_thd: 8
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- Options: ['perf']
** Time, Load model: 3.01 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): yes
** VRAM, Model: [cuda:0] 3,638.47 MB - [cuda:1] 0.00 MB
** VRAM, Cache: [cuda:0] 1,024.00 MB - [cuda:1] 0.00 MB
-- Warmup pass 1...
** Time, Warmup: 1.28 seconds
-- Warmup pass 2...
** Time, Warmup: 0.16 seconds
-- Inference, first pass.
** Time, Inference: 0.17 seconds
** Speed: 11437.50 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 77.62 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 97.20 tokens/second
** VRAM, Inference: [cuda:0] 143.92 MB - [cuda:1] 0.00 MB
** VRAM, Total: [cuda:0] 4,806.38 MB - [cuda:1] 0.00 MB```
77-97 tok/s
The prompt speed is lower than it should be as well. Kind of suggests the GPU is running slower than it should for some reason.
Updated to one driver version newer 536.99, benchmark speed is slightly lower now. Will revert through the last ~5-10 versions next: -- Generating 128 tokens, 1920 token prompt... ** Speed: 76.58 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 91.48 tokens/second
Update: I had disabled hardware accelerate graphics thing a while ago, just turned it back on, and now: -- Generating 128 tokens, 1920 token prompt... ** Speed: 100.54 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 119.11 tokens/second Still on that latest driver, now will revert through the downgrades until I max it out from driver version. Much closer!
Rolled back to the original driver now, with hardware acceleration on: Driver: 536.67 w/ Hardware Acceleration -- Generating 128 tokens, 1920 token prompt... ** Speed: 104.43 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 130.66 tokens/second
Interesting note here: With hardware accel back on, 70b multi-gpu inference takes a big hit, back down to ~11tok/s from 16
Driver version 536.40 w/ Hardware Acceleration -- Generating 128 tokens, 1920 token prompt... ** Speed: 103.78 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 124.18 tokens/second
Update 2: Will just add each driver version's benchmarks here now for a comprehensive list in one post lol
Driver Version: 536.23 -- Generating 128 tokens, 1920 token prompt... ** Speed: 102.12 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 115.82 tokens/second
Driver Version: 532.03 ** Speed: 12213.77 tokens/second -- Generating 128 tokens, 1920 token prompt... ** Speed: 103.64 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 129.70 tokens/second
Driver Version: 531.79 -- Generating 128 tokens, 1920 token prompt... ** Speed: 102.45 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 127.56 tokens/second