exllama
exllama copied to clipboard
RuntimeError: CUDA error: an illegal memory access was encountered
RuntimeError Traceback (most recent call last)
Cell In[3], line 4
2 config.model_path = model_path
3 config.max_seq_len = 2048
----> 4 model = ExLlama(config)
5 cache = ExLlamaCache(model)
6 tokenizer = ExLlamaTokenizer(tokenizer_model_path)
File /workspace/exllama/model.py:759, in ExLlama.__init__(self, config)
756 device = self.config.device_map.layers[i]
757 sin, cos = self.sincos[device]
--> 759 layer = ExLlamaDecoderLayer(self.config, tensors, f"model.layers.{i}", i, sin, cos)
761 modules.append(layer)
763 self.layers = modules
File /workspace/exllama/model.py:345, in ExLlamaDecoderLayer.__init__(self, config, tensors, key, index, sin, cos)
342 self.config = config
343 self.index = index
--> 345 self.self_attn = ExLlamaAttention(self.config, tensors, key + ".self_attn", sin, cos, self.index)
346 self.mlp = ExLlamaMLP(self.config, tensors, key + ".mlp")
348 self.input_layernorm = ExLlamaRMSNorm(self.config, tensors, key + ".input_layernorm.weight")
File /workspace/exllama/model.py:260, in ExLlamaAttention.__init__(self, config, tensors, key, sin, cos, index)
258 self.k_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".k_proj")
259 self.v_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".v_proj")
--> 260 self.o_proj = Ex4bitLinear(config, self.config.num_attention_heads * self.config.head_dim, self.config.hidden_size, False, tensors, key + ".o_proj")
File /workspace/exllama/model.py:137, in Ex4bitLinear.__init__(self, config, in_features, out_features, has_bias, tensors, key)
135 self.qzeros = tensors[key + ".qzeros"]
136 self.scales = tensors[key + ".scales"]
--> 137 self.g_idx = tensors[key + ".g_idx"].cpu() if key + ".g_idx" in tensors else None
138 self.bias = tensors[key + ".bias"] if has_bias else None
140 self.device = self.qweight.device
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
running on runpod, 2x3090 with 2.1.0.dev20230607+cu118
Since this happens during loading, I suspect you're running out of memory. You'll sometimes just get CUDA illegal memory exceptions when that happens.
But what is the model you're loading and what does the command line look like? You have to explicitly allocate space on each GPU, mind you. For a headless 2x3090 system and 65B, you'll probably want to load with -gs 18,24 (18 GB of weights on the first GPU, no limit on the second), and if that doesn't work try decreasing it a bit to -gs 17.8,24 etc.
Sorry forget to check model_init file, I adapted the config now it is working.
config = ExLlamaConfig(model_config_path) config.model_path = model_path config.max_seq_len = 2048 config.set_auto_map('16,24') config.gpu_peer_fix = True model = ExLlama(config) cache = ExLlamaCache(model) tokenizer = ExLlamaTokenizer(tokenizer_model_path) generator = ExLlamaGenerator(model, tokenizer, cache)
which is weird, since I'm loading a 30b 4bit gptq model which suppose to fit in a single 3090. not sure why I have to set the GPU map. However after the split I only see one GPU got utilized tho, is this expected(basically if one GPU can do the job we dont touch the 2nd one)?
No, it shouldn't need the second GPU if the model fits on the first. I guess it might be a bug. You could try with export CUDA_VISIBLE_DEVICES=0 and without any device mapping. If that works there's something wrong with the way it's trying to load the model.
I'm also getting RuntimeError: CUDA error: an illegal memory access was encountered I'm also using 2x 3090 Using oogabooga webui, latest with exllama, latest and sillytavern, latest. Happens randomly during generation. With multiple 65b models.
What's your GPU split in this case? And if you run nvidia-smi while the model is working, what's the output?
(within oogabooga webui) I have set the split to 16,23
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 Off| 00000000:0A:00.0 Off | N/A | | 0% 42C P2 290W / 350W| 20720MiB / 24576MiB | 52% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 Off| 00000000:0B:00.0 Off | N/A | | 0% 37C P2 305W / 350W| 20025MiB / 24576MiB | 51% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2162 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 1483282 C python 20712MiB | | 1 N/A N/A 2162 G /usr/lib/xorg/Xorg 49MiB | | 1 N/A N/A 1400103 C+G /usr/lib/virtualbox/VirtualBoxVM 551MiB | | 1 N/A N/A 1483282 C python 19412MiB | +---------------------------------------------------------------------------------------+
Ignore the cuda version, its being run from a diff env Can you read that or should I try a screen shot?
I can read it. The important thing is whether one card was right on the cusp of running out of memory, since that can sometimes give CUDA errors like that. But you seem to have it perfectly balanced, so that's not it.
If you get these errors consistently, maybe you could try ExLlama's web UI for a bit to see if it's more stable? It would help pin down whether the bug is in ExLlama or in how Ooba is using it, or somewhere in between.
Doing further testing within oobabooga webui I have gotten the (I think the same) error to occur again right after this error: key_states = cache.key_states[self.index].narrow(2, 0, past_len + q_len) RuntimeError: start (0) + length (2049) exceeds dimension size (2048).
If I understand that correctly it went over the set context limit by 1.
I'm going to try and reduce the context length settings and see what happens. (by the way if this is caused by oobabooga webui and not exllama, my apologies.)
Hmm, going over the context length, even by one token, is definitely an error. It won't really do anything in regular GPTQ-for-LLaMa because the cache will just grow a bit bigger and Llama still "kind of" works up to maybe 2100 tokens. But ExLlama uses a fixed, pre-allocated cache, so inference on 2049 tokens will throw an exception and then give undefined behavior if the exception is ignored.
You could try lowering the sequence length a bit, or replacing line 44 in modules/exllama.py with:
cache = ExLlamaCache(model, max_seq_len = 2056)
To give ExLlama a little more room to work.
cache = ExLlamaCache(model, max_seq_len = 2056) makes the model output pure giberish. Even just setting it to 2049 makes it hallucinate badly.
With further testing I believe context going over the set limits IS causing this issue.
Within SillyTavern using oobabooga's api and any model loaded with exllama, the full allowed response length is used up within the context no matter how many actual tokens are used. So if the limit is 400 tokens and the reply is "Hi" 400 tokens are still used. This only happens with models loaded using exllama. No clue which software is responsible for that bug but I believe it causes the context to go over the set limit and thus causing the errors.
Hmm, you shouldn't be getting corrupted output, but maybe it's because the model config disagrees with the cache config. In which case that shouldn't really be an option for the ExLlamaCache constructor... I'll fix that eventually.
In the meantime, you could try setting max_seq_len in the model config instead, so before line 42:
config.max_seq_len = 2056