exllama Slower tokens/s than expecting

Hello I am running a 2x 4090 PC, Windows, with exllama on 7b llama-2.

I am only getting ~70-75t/s during inference (using just 1x 4090), but based on the charts, I should be getting 140+t/s.

What could be causing?

Aug 07 '23 10:08 teknium1

I know on Windows, Hardware-Accelerated GPU Scheduling can make a big difference to performance, so you might try enabling that.

But even without that you should be seeing more t/s on a single 4090. There could be something keeping the GPU occupied or power limited, or maybe your CPU is very slow?

I recently added the --affinity argument which you could try. It will pin the process to the listed cores, just in case Windows tries to schedule ExLlama on efficiency cores for some reason. E.g. run with --affinity 0,1,2,3,4,5,6,7 or whatever is appropriate for your CPU.

Aug 07 '23 11:08 turboderp

hmm, my CPU shouldnt be slow (13700k), but it may not be using everything it needs to, it seems to not be using all cores,

Do I set --affinity as an arg to any of my inference scripts, it didn't seem to affect the CPU usage or speed much: Time taken for Response: 8.5787 seconds tokens total: 696 tokens/second: 81.12

Aug 07 '23 12:08 teknium1

For reference my inference code:

from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import os, glob, time

# Directory containing model, tokenizer, generator

model_directory =  "C:\Teknium\Models\StableBeluga-7B-GPTQ\\"

# Locate files we need within that directory

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer and generator

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

#generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5

# Produce a simple generation

prompt = "### Instruction:\nWrite a story about a dog getting out of jail\n### Response:\n"
print (prompt, end = "")
start_time = time.time()
output = generator.generate_simple(prompt, max_new_tokens = 2048)
print(output[len(prompt):])
end_time = time.time()  # End timing
elapsed_time = end_time - start_time  # Calculate time taken for the iteration
print(f"Time taken for Response: {elapsed_time:.4f} seconds")
print(f"tokens total: {len(tokenizer.encode(output[len(prompt):]).tolist()[0])}")```

Aug 07 '23 12:08 teknium1

Launching that script with --affinity 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 yields these graphs when inferencing:

Aug 07 '23 12:08 teknium1

--affinity would only matter if for some reason the OS scheduler isn't doing its job properly and assigning the process to performance cores, which it should do automatically. The fact that some cores are hitting 100% doesn't mean you're CPU bound, though. PyTorch/CUDA will always do that, no matter what. It doesn't yield available CPU time while synchronizing to the GPU.

Do you have hardware-accelerated GPU scheduling enabled? And is there anything else using the same GPU, like an animated Windows wallpaper or something? Long shot, I know, but it's worth ruling it out just to be sure.

Aug 07 '23 14:08 turboderp

Will --affinity work no matter if the script directly implements something to handle it?

I am now getting in-line expected speeds for multigpu 70b inference, about 13.5t/s average - and I do get a boost on 7b - from 78->86 tok/s, after upgrading to windows 11, but 7b is still almost 45% slower than it should be. I disabled hardware-accelerated GPU scheduling, will lyk when I restart and it is disabled. I will try isolating to the 2nd gpu that has no display attached and see if the speed is faster, would I do that by setting device_map to [0,24]?

Aug 07 '23 18:08 teknium1

Hardware accelerated GPU scheduling should preferably be enabled, not disabled. But idk. Windows is odd sometimes.

To run on just the second GPU, yes, set the device map as you suggest.

I'm curious though. Have you tried just running the benchmark script? python test_benchmark_inference.py -d <your model dir> -p? It's possible that it's the sampler slowing you down and not the model itself.

Aug 08 '23 01:08 turboderp

Also, what NVIDIA driver version are you on? Apparently everyone has been seeing a big drop in performance after version 535.something.

Aug 08 '23 08:08 turboderp

My driver is 31.0.15.3667 (Nvidia 536.67)

Will try with benchmark script.

Aug 08 '23 12:08 teknium1

That's definitely one of the newer drivers that people have been having issues with. You might want to try on 531.x.

Aug 08 '23 12:08 turboderp

Will update when I downgrade the drivers and do the benchmark script

Aug 08 '23 13:08 teknium1

Updating on benchmark script: Haven't rolled back driver yet

 -- Tokenizer: C:\Teknium\Models\StableBeluga-7B-GPTQ\tokenizer.model
 -- Model config: C:\Teknium\Models\StableBeluga-7B-GPTQ\config.json
 -- Model: C:\Teknium\Models\StableBeluga-7B-GPTQ\gptq_model-4bit-128g.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --sdp_thd: 8
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- Options: ['perf']
 ** Time, Load model: 3.01 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 3,638.47 MB - [cuda:1] 0.00 MB
 ** VRAM, Cache: [cuda:0] 1,024.00 MB - [cuda:1] 0.00 MB
 -- Warmup pass 1...
 ** Time, Warmup: 1.28 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 0.16 seconds
 -- Inference, first pass.
 ** Time, Inference: 0.17 seconds
 ** Speed: 11437.50 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 77.62 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 97.20 tokens/second
 ** VRAM, Inference: [cuda:0] 143.92 MB - [cuda:1] 0.00 MB
 ** VRAM, Total: [cuda:0] 4,806.38 MB - [cuda:1] 0.00 MB```
 
 77-97 tok/s

Aug 09 '23 05:08 teknium1

The prompt speed is lower than it should be as well. Kind of suggests the GPU is running slower than it should for some reason.

Aug 09 '23 08:08 turboderp

Updated to one driver version newer 536.99, benchmark speed is slightly lower now. Will revert through the last ~5-10 versions next: -- Generating 128 tokens, 1920 token prompt... ** Speed: 76.58 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 91.48 tokens/second

Update: I had disabled hardware accelerate graphics thing a while ago, just turned it back on, and now: -- Generating 128 tokens, 1920 token prompt... ** Speed: 100.54 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 119.11 tokens/second Still on that latest driver, now will revert through the downgrades until I max it out from driver version. Much closer!

Rolled back to the original driver now, with hardware acceleration on: Driver: 536.67 w/ Hardware Acceleration -- Generating 128 tokens, 1920 token prompt... ** Speed: 104.43 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 130.66 tokens/second

Interesting note here: With hardware accel back on, 70b multi-gpu inference takes a big hit, back down to ~11tok/s from 16

Driver version 536.40 w/ Hardware Acceleration -- Generating 128 tokens, 1920 token prompt... ** Speed: 103.78 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 124.18 tokens/second

Update 2: Will just add each driver version's benchmarks here now for a comprehensive list in one post lol

Driver Version: 536.23 -- Generating 128 tokens, 1920 token prompt... ** Speed: 102.12 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 115.82 tokens/second

Driver Version: 532.03 ** Speed: 12213.77 tokens/second -- Generating 128 tokens, 1920 token prompt... ** Speed: 103.64 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 129.70 tokens/second

Driver Version: 531.79 -- Generating 128 tokens, 1920 token prompt... ** Speed: 102.45 tokens/second -- Generating 128 tokens, 4 token prompt... ** Speed: 127.56 tokens/second

Aug 09 '23 10:08 teknium1

exllama exllama copied to clipboard

Slower tokens/s than expecting

exllama
exllama copied to clipboard