exllama Weird issue with context length

First of all, thanks a lot for this great project!

I got a weird issue when generating with llama 2 on 4096 context using generator.generate_simple,

  File "/codebase/research/exllama/model.py", line 556, in forward
    cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds

As I understand the code, it already limits the number of new tokens to under the context limit. Is there any settings that I might need to change?

Aug 03 '23 19:08 zac-wang-nv

What is the sequence length set to in the model config? Maybe something weird is happening if you haven't changed it from the default (2048), and it tries to generate a negative number of tokens.

Aug 04 '23 10:08 turboderp

thanks for the reply

{
    "architectures": [
        "LlamaForCausalLM"
    ],
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "silu",
    "hidden_size": 8192,
    "initializer_range": 0.02,
    "intermediate_size": 28672,
    "max_position_embeddings": 4096,
    "max_length": 4096,
    "model_type": "llama",
    "num_attention_heads": 64,
    "num_hidden_layers": 80,
    "num_key_value_heads": 8,
    "pad_token_id": 0,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_scaling": null,
    "tie_word_embeddings": false,
    "torch_dtype": "float16",
    "transformers_version": "4.32.0.dev0",
    "use_cache": true,
    "vocab_size": 32000
}

here is the model config file, I got the model from Llama-2-70B-chat-gptq

Aug 04 '23 14:08 zac-wang-nv

Is there more of this error message?

  File "/codebase/research/exllama/model.py", line 556, in forward
    cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (3313) + length (808) exceeds

It looks like it's been cut off.

Also, the line number is weird. Has something else been modified in model.py, because ExLlamaAttention.forward ends on line 502?

Aug 05 '23 02:08 turboderp

I got a similar error. It seems to come from trying to put too many tokens into the model. I was putting 5k words into the model.

  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ndurkee\.conda\envs\exllama\Lib\site-packages\flask\app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\example_flask.py", line 48, in inferContextP
    outputs = generator.generate_simple(prompt, max_new_tokens=16000)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 316, in generate_simple
    self.gen_begin(ids, mask = mask)
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\generator.py", line 186, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora, input_mask = mask)
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 967, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 1053, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 536, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\Flashblade_ML_Data_Test\meeting_notes\exllama\model.py", line 440, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

I'm using the_bloke/vicuna-13B-v1.5-16K-GPTQ which is supposed to be a 16k context model so it should be able to handle it. At any rate, this is the relevant portions of the config.json.

    "max_sequence_length": 16384,
    "max_position_embeddings": 4096,

What I found that worked was changing the parameters on lines 82-87 in model.py

        self.max_seq_len = 16384  # Reduce to save memory. Can also be increased, ideally while also using compress_pos_emn and a compatible model/LoRA
        self.max_input_len = 4096  # Maximum length of input IDs in a single forward pass. Sequences longer than this will be processed in multiple steps
        self.max_attention_size = 2048**2  # Sequences will be processed in chunks to keep the size of the attention weights matrix <= this
        self.compress_pos_emb = 4.0  # Increase to compress positional embeddings applied to sequence

Previously, these were 2048, 2048, 4096, and 1.0 respectively. This worked and seems to give reasonable results but I'm not sure if it's the correct way to go about it.

Aug 11 '23 12:08 w013nad

@w013nad Where do you define those changes? In the source code or generator model settings?

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.05
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5
generator.settings.max_seq_len = 16000
# Produce a simple generation

output = generator.generate_simple(prompt_template, max_new_tokens = 500)

I am using the same model but getting the following error

RuntimeError: start (2048) + length (1265) exceeds dimension size (2048).

Sep 14 '23 22:09 Rajmehta123

@w013nad You wouldn't need to hard-code new values into the config class. You can just override the values after creating the config.

Also, it looks like that config file is incorrect. "max_sequence_length" and "max_position_embeddings" should mean the same thing, or at least I don't know how to interpret those values if they're different.

The max_input_len argument means specifically the longest sequence to allow during a forward pass. Longer sequences will be chunked into portions of this length to reduce VRAM usage during inference, and to make the VRAM requirement predictable which is sort of required when splitting the model across multiple devices. But max_attention_size imposes an additional restriction on the chunk length. In short, setting max_input_len > sqrt(max_attention_size) just wastes a bit of VRAM.

@Rajmehta123 The max_seq_len parameter is in the ExLlamaConfig object, not the generator settings.

Sep 15 '23 00:09 turboderp

exllama exllama copied to clipboard

Weird issue with context length

exllama
exllama copied to clipboard