exllama
exllama copied to clipboard
OOM/CUDA errors when running in batch mode?
import argparse
import os
import glob
import time
import subprocess
from itertools import cycle
from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
# Directory containing model, tokenizer, generator
model_directory = "/root/pulsar/Charybdis-v1.0-GPTQ"
# Locate files we need within that directory
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]
def generate_responses(num_prompts, response_length=200, prompt_multiplier=10):
# Generate prompts
base_prompts = [
"Once upon a time," * prompt_multiplier,
"I don't like to" * prompt_multiplier,
"A turbo encabulator is a" * prompt_multiplier,
"In the words of Mark Twain," * prompt_multiplier,
]
prompts = [p for _, p in zip(range(num_prompts), cycle(base_prompts))]
# Create config, model, tokenizer and generator
config = ExLlamaConfig(model_config_path) # create config from config.json
config.model_path = model_path # supply path to model weights file
model = ExLlama(config) # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(
tokenizer_path
) # create tokenizer from tokenizer model file
cache = ExLlamaCache(model, batch_size=len(prompts)) # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache) # create generator
# Configure generator
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
# Generate, batched
start_time_batch = time.time()
output = generator.generate_simple(prompts, max_new_tokens=response_length)
end_time_batch = time.time()
time_taken = end_time_batch - start_time_batch
for line in output:
print("---")
print(line)
print(
f"Time taken to generate {len(prompts)} responses in BATCH MODE: {time_taken} seconds. Average time per prompt: {time_taken/len(prompts)} seconds"
)
command = "nvidia-smi"
result = subprocess.run(command, shell=True, capture_output=True, text=True)
print(result.stdout)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Generate responses for given number of prompts."
)
parser.add_argument(
"-p", "--prompts", type=int, help="Number of prompts for the generator."
)
parser.add_argument(
"-l", "--response_length", type=int, help="Length of the response in tokens."
)
parser.add_argument(
"-m", "--prompt_multiplier", type=int, help="Multiplier for the prompts."
)
args = parser.parse_args()
generate_responses(args.prompts, args.response_length, args.prompt_multiplier)
I'm using a slightly modified version of example_batch.py to test batching performance and I encounter errors on an A6000 (48GB vRAM) with p set to 10 and m > 25. The weird thing is it works fine and only uses half vRAM, ie. around 25GB when m is set to 25. What could be going on?
Command: python example_batch.py -l 200 -p 10 -m 26 Output:
python example_batch.py -l 200 -p 10 -m 26
Already up to date.
Traceback (most recent call last):
File "/root/pulsar/example_batch.py", line 85, in <module>
generate_responses(args.prompts, args.response_length, args.prompt_multiplier)
File "/root/pulsar/example_batch.py", line 55, in generate_responses
output = generator.generate_simple(prompts, max_new_tokens=response_length)
File "/root/pulsar/generator.py", line 311, in generate_simple
self.gen_begin(ids)
File "/root/pulsar/generator.py", line 177, in gen_begin
self.model.forward(self.sequence[:, a:b], self.cache, preprocess_only = True, lora = self.lora)
File "/root/pulsar/model.py", line 860, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/root/pulsar/model.py", line 466, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "/root/pulsar/model.py", line 391, in forward
new_keys.copy_(key_states)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Command: python example_batch.py -l 200 -p 10 -m 25 Output:
Time taken to generate 10 responses in BATCH MODE: 21.112586736679077 seconds. Average time per prompt: 2.1112586736679075 seconds
Fri Jun 30 19:38:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:02:00.0 Off | Off |
| 35% 67C P2 296W / 300W| 24351MiB / 49140MiB | 95% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Okay, figured it out -- with batching it loads lot more in mem at once so the seq_length matters (needs to be big enough to fit the batch), increasing it using cpe scaling seems to have done the trick in letting me run things like python example_batch.py -l 200 -p 10 -m 51 and I get Time taken to generate 10 responses in BATCH MODE: 22.564942359924316 seconds. Average time per prompt: 2.256494235992432 seconds
Based on the above, I can roughly put 51 * 10 * 19 (9690) words into context before OOMing. If I were to increase p to 20 and decrease m to 20, I still can't seem to run the code without OOMing even tho tokens in memory should be roughly the same, any idea whats going on? ie. running python example_batch.py -l 200 -p 20 -m 20 causes me to OOM
Seems like higher batch count means much more memory? setting p anything over 10 is causing OOMs on a 48GB card, ie even python example_batch.py -l 200 -p 11 -m 1 fails
The size of the cache is: 2 * max_seq_len * num_hidden_layers * hidden_size * sizeof(float16). For a 2048-token context that works out to:
- 7b: 1,024 MB
- 13b: 1,600 MB
- 33b: 3,120 MB
- 65b: 5,120 MB
If you increase the sequence length, the cache size increases proportionately, so e.g. 7b at 8k context needs a 4 GB cache. Likewise, increasing the batch size also multiplies the size of the cache.
I haven't extensively tested how well it deals with very large batches. There is a known issues with one or more of the CUDA kernels acting up if the total sequence length exceeds a certain threshold in prompt processing, something to do with the block grid, idk. It's possible that it also triggers if the sequence length times the batch size exceeds that threshold. You can try setting config.max_input_len = 512 or some such which puts some restraints on the prompt processing. It isn't really a good solution, though, since it sacrifices a bit of performance.
I'll look into it soon.
Okay, figured it out -- with batching it loads lot more in mem at once so the seq_length matters (needs to be big enough to fit the batch)
also to run large batches I require to scale seq_len as well, making batch_size and seq_len dependent on each other -- this shouldn't be the case right @turboderp? the cache_size initialization should be seperate to model seq len in batch mode?
def calculate_max_size_per_batch(batch_size, sequence_length=2048):
# Set some values up
num_hidden_layers = 40
hidden_size = 5120
sizeof_float16 = 2 # bytes
vram_available = 15 # 15GB for 3090/4090
# figure out vram available
cache_vram_used = ((2 * batch_size * num_hidden_layers * hidden_size * sizeof_float16 * sequence_length)) / (1024 * 1024 * 1024)
cache_vram_remaining = vram_available - cache_vram_used
max_prompt_size = (sequence_length/batch_size) # max prompt size in tokens
if cache_vram_remaining < 1: # leaving enough for overhead
return "OOM"
return f"{cache_vram_remaining} GB", f"{int(max_prompt_size*(3/5))}-{int(max_prompt_size*(3/4.3))} words", f"roughly {max_prompt_size} tokens"
batch_size = 2
total_token_size = 18000 # derived based off vRAM
seq_len = total_token_size/batch_size
print(f"Please set seq len to {seq_len}")
print(calculate_max_size_per_batch(batch_size, seq_len))
This is the code I'm using to figure out batching params to utilize max vRAM, without scaling seq_len with batch_size it always OOMs
@turboderp small bump on last post, how is seq_len supposed to function in batch mode? do I need to increase cpe with seq len even if the per-batch seq len is lower?
You should use the cpe value that's appropriate for the model, in any case. The 8k SuperHOT models are tuned for a factor of 4.0, regardless of how much of the useful 8192-token space you end up using.
As for its relation to batch mode, batch size and sequence length are independent dimensions of the input. Batch size of 3 just means do everything 3 times in parallel, which also requires 3 times as much VRAM for the cache and for the attention weights. There shouldn't be any other side effects like affection position embeddings.
The size of the attention weight matrix is batch_size * (past_len + seq_len) * seq_len * num_heads * sizeof(float16) where seq_len here is just the tokens sent through the forward pass in one chunk. You probably want to account for that as well, but it's limited by default to the memory that would be used for a 2048-token inference. You can configure it with the max_input_len parameter in the config, or max_attention_size. This is a relatively recent addition, though, so I'm not sure if it applies to your example. But definitely if you try to run inference on even a single batch of 9000 tokens that can eat up some gigabytes also.