exllama OOM/CUDA errors when running in batch mode?

import argparse
import os
import glob
import time
import subprocess
from itertools import cycle

from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator

# Directory containing model, tokenizer, generator
model_directory = "/root/pulsar/Charybdis-v1.0-GPTQ"

# Locate files we need within that directory
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]


def generate_responses(num_prompts, response_length=200, prompt_multiplier=10):
    # Generate prompts
    base_prompts = [
        "Once upon a time," * prompt_multiplier,
        "I don't like to" * prompt_multiplier,
        "A turbo encabulator is a" * prompt_multiplier,
        "In the words of Mark Twain," * prompt_multiplier,
    ]
    prompts = [p for _, p in zip(range(num_prompts), cycle(base_prompts))]

    # Create config, model, tokenizer and generator
    config = ExLlamaConfig(model_config_path)  # create config from config.json
    config.model_path = model_path  # supply path to model weights file

    model = ExLlama(config)  # create ExLlama instance and load the weights
    tokenizer = ExLlamaTokenizer(
        tokenizer_path
    )  # create tokenizer from tokenizer model file

    cache = ExLlamaCache(model, batch_size=len(prompts))  # create cache for inference
    generator = ExLlamaGenerator(model, tokenizer, cache)  # create generator

    # Configure generator
    generator.disallow_tokens([tokenizer.eos_token_id])

    generator.settings.token_repetition_penalty_max = 1.2
    generator.settings.temperature = 0.95
    generator.settings.top_p = 0.65
    generator.settings.top_k = 100
    generator.settings.typical = 0.5

    # Generate, batched
    start_time_batch = time.time()
    output = generator.generate_simple(prompts, max_new_tokens=response_length)
    end_time_batch = time.time()
    time_taken = end_time_batch - start_time_batch
    for line in output:
        print("---")
        print(line)

    print(
        f"Time taken to generate {len(prompts)} responses in BATCH MODE: {time_taken} seconds. Average time per prompt: {time_taken/len(prompts)} seconds"
    )
    command = "nvidia-smi"
    result = subprocess.run(command, shell=True, capture_output=True, text=True)
    print(result.stdout)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Generate responses for given number of prompts."
    )
    parser.add_argument(
        "-p", "--prompts", type=int, help="Number of prompts for the generator."
    )
    parser.add_argument(
        "-l", "--response_length", type=int, help="Length of the response in tokens."
    )
    parser.add_argument(
        "-m", "--prompt_multiplier", type=int, help="Multiplier for the prompts."
    )
    args = parser.parse_args()

    generate_responses(args.prompts, args.response_length, args.prompt_multiplier)

I'm using a slightly modified version of example_batch.py to test batching performance and I encounter errors on an A6000 (48GB vRAM) with p set to 10 and m > 25. The weird thing is it works fine and only uses half vRAM, ie. around 25GB when m is set to 25. What could be going on?

Command: python example_batch.py -l 200 -p 10 -m 26 Output:

python example_batch.py -l 200 -p 10 -m 26
Already up to date.
Traceback (most recent call last):
  File "/root/pulsar/example_batch.py", line 85, in <module>
    generate_responses(args.prompts, args.response_length, args.prompt_multiplier)
  File "/root/pulsar/example_batch.py", line 55, in generate_responses
    output = generator.generate_simple(prompts, max_new_tokens=response_length)
  File "/root/pulsar/generator.py", line 311, in generate_simple
    self.gen_begin(ids)
  File "/root/pulsar/generator.py", line 177, in gen_begin
    self.model.forward(self.sequence[:, a:b], self.cache, preprocess_only = True, lora = self.lora)
  File "/root/pulsar/model.py", line 860, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/root/pulsar/model.py", line 466, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/root/pulsar/model.py", line 391, in forward
    new_keys.copy_(key_states)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Command: python example_batch.py -l 200 -p 10 -m 25 Output:


Time taken to generate 10 responses in BATCH MODE: 21.112586736679077 seconds. Average time per prompt: 2.1112586736679075 seconds
Fri Jun 30 19:38:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000                On | 00000000:02:00.0 Off |                  Off |
| 35%   67C    P2              296W / 300W|  24351MiB / 49140MiB |     95%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Jun 30 '23 19:06 nikshepsvn

Okay, figured it out -- with batching it loads lot more in mem at once so the seq_length matters (needs to be big enough to fit the batch), increasing it using cpe scaling seems to have done the trick in letting me run things like python example_batch.py -l 200 -p 10 -m 51 and I get Time taken to generate 10 responses in BATCH MODE: 22.564942359924316 seconds. Average time per prompt: 2.256494235992432 seconds

Based on the above, I can roughly put 51 * 10 * 19 (9690) words into context before OOMing. If I were to increase p to 20 and decrease m to 20, I still can't seem to run the code without OOMing even tho tokens in memory should be roughly the same, any idea whats going on? ie. running python example_batch.py -l 200 -p 20 -m 20 causes me to OOM

Jun 30 '23 22:06 nikshepsvn

Seems like higher batch count means much more memory? setting p anything over 10 is causing OOMs on a 48GB card, ie even python example_batch.py -l 200 -p 11 -m 1 fails

Jun 30 '23 22:06 nikshepsvn

The size of the cache is: 2 * max_seq_len * num_hidden_layers * hidden_size * sizeof(float16). For a 2048-token context that works out to:

7b: 1,024 MB
13b: 1,600 MB
33b: 3,120 MB
65b: 5,120 MB

If you increase the sequence length, the cache size increases proportionately, so e.g. 7b at 8k context needs a 4 GB cache. Likewise, increasing the batch size also multiplies the size of the cache.

I haven't extensively tested how well it deals with very large batches. There is a known issues with one or more of the CUDA kernels acting up if the total sequence length exceeds a certain threshold in prompt processing, something to do with the block grid, idk. It's possible that it also triggers if the sequence length times the batch size exceeds that threshold. You can try setting config.max_input_len = 512 or some such which puts some restraints on the prompt processing. It isn't really a good solution, though, since it sacrifices a bit of performance.

I'll look into it soon.

Jul 01 '23 12:07 turboderp

Okay, figured it out -- with batching it loads lot more in mem at once so the seq_length matters (needs to be big enough to fit the batch)

also to run large batches I require to scale seq_len as well, making batch_size and seq_len dependent on each other -- this shouldn't be the case right @turboderp? the cache_size initialization should be seperate to model seq len in batch mode?

def calculate_max_size_per_batch(batch_size, sequence_length=2048):
    # Set some values up
    num_hidden_layers = 40
    hidden_size = 5120
    sizeof_float16 = 2  # bytes
    vram_available = 15 # 15GB for 3090/4090

    # figure out vram available
    cache_vram_used = ((2 * batch_size * num_hidden_layers * hidden_size * sizeof_float16 * sequence_length)) / (1024 * 1024 * 1024)
    cache_vram_remaining = vram_available - cache_vram_used
  
    max_prompt_size = (sequence_length/batch_size) # max prompt size in tokens 
    if cache_vram_remaining < 1: # leaving enough for overhead
      return "OOM"
    return f"{cache_vram_remaining} GB", f"{int(max_prompt_size*(3/5))}-{int(max_prompt_size*(3/4.3))} words", f"roughly {max_prompt_size} tokens"

batch_size = 2
total_token_size = 18000 # derived based off vRAM
seq_len = total_token_size/batch_size
print(f"Please set seq len to {seq_len}")
print(calculate_max_size_per_batch(batch_size, seq_len))

This is the code I'm using to figure out batching params to utilize max vRAM, without scaling seq_len with batch_size it always OOMs

Jul 02 '23 04:07 nikshepsvn

@turboderp small bump on last post, how is seq_len supposed to function in batch mode? do I need to increase cpe with seq len even if the per-batch seq len is lower?

Jul 05 '23 18:07 nikshepsvn

You should use the cpe value that's appropriate for the model, in any case. The 8k SuperHOT models are tuned for a factor of 4.0, regardless of how much of the useful 8192-token space you end up using.

As for its relation to batch mode, batch size and sequence length are independent dimensions of the input. Batch size of 3 just means do everything 3 times in parallel, which also requires 3 times as much VRAM for the cache and for the attention weights. There shouldn't be any other side effects like affection position embeddings.

The size of the attention weight matrix is batch_size * (past_len + seq_len) * seq_len * num_heads * sizeof(float16) where seq_len here is just the tokens sent through the forward pass in one chunk. You probably want to account for that as well, but it's limited by default to the memory that would be used for a 2048-token inference. You can configure it with the max_input_len parameter in the config, or max_attention_size. This is a relatively recent addition, though, so I'm not sure if it applies to your example. But definitely if you try to run inference on even a single batch of 9000 tokens that can eat up some gigabytes also.

Jul 05 '23 21:07 turboderp

exllama exllama copied to clipboard

OOM/CUDA errors when running in batch mode?

exllama
exllama copied to clipboard