exllama Very poor output quality

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters

Jun 12 '23 19:06 calebmor460

I haven't seen this at all. What model are you using? And what settings?

Jun 12 '23 19:06 turboderp

Tried on Chronos 13b, wizard-lm 13b, and Pygmalion 7b, I used temperatures in between 0.5 and 1, and a context length of 2048, lower temperatures do seem to wrangle it into behaving a little more, but I have to lower the temperature so much the output is too “dry” to be useful . However using the same settings and models on normal GPTQ yields satisfactory results(Albeit at unsatisfactory speed)

Jun 12 '23 19:06 calebmor460

And just to be clear, is this in ExLlama's web UI or in Ooba?

Jun 12 '23 20:06 turboderp

occam's fork of koboldAI that allows using exllama

using gptQ, said fork behaves normally

Jun 12 '23 20:06 calebmor460

Not OP, but for context, the Kobold fork is here if you want to check it, turbo.

https://github.com/0cc4m/KoboldAI/tree/4bit-plugin (KoboldAI implementation to support GTPQ and exllama)

https://github.com/0cc4m/exllama (exllama fork on transformers branch, which builds exllama to work on Kobold)

They added Kobold samplers to that exllama fork.

So it seems these samplers are added

(I'm not sure about rep pen slope though)

Jun 12 '23 22:06 Panchovix

Okay. I really have enough work cut out for me with this, but I guess I should try installing Kobold at some point to see how they're using it. I would assume they're just taking the logits and passing them to the same samplers they use for other models, and that should just work. But there are some peculiarities to keep in mind, specifically regarding the cache, and that "context tokens" slider looks a little suspect. But idk.

Jun 12 '23 22:06 turboderp

you think maybe the code wasn't hooked up to the context correctly and it's actually running on incredibly low context size?

Jun 12 '23 22:06 calebmor460

I'm not sure what that slider does, but if it truncates the cache that would definitely lead to degenerate output since the position embeddings for cached entries would be wrong. But, looking at the Transformers wrapper they added I think it's just an issue with how the cache is being passed around. It has to stay in sync with the sequence for every forward pass.

E.g. if the model generates an EOS token, and their generator doesn't add that to the running sequence, it has to be removed from the cache. Or something similar along those lines. The cache being out of sync is the kind of thing which might leave it working poorly without crashing. But I'd have to install it and run it in a debugger to make sure. Which I will. After doing some other stuff first.

Jun 12 '23 22:06 turboderp

Alright then, thank you for taking a look at it

Jun 12 '23 22:06 calebmor460

@turboderp Apologies for this, this should have gone to me directly.

I do use the KoboldAI samplers, here's the code if you're interested. It seems to work the first time or times you generate, but breaks afterwards. I'm not yet sure why. I do call generator.gen_begin(gen_in), which resets the cache as far as I know.

Jun 13 '23 06:06 0cc4m

Yes, using the KoboldAI samplers is the obvious choice for integrating into Kobold, so that's great. There's nothing special about the logits, after all. In fact you should just be able to bypass ExLlamaGenerator altogether and call the forward pass directly.

I'm going to install the 4-bit branch and have a play with it later today. But I don't see anything immediately wrong with how you're using it. gen_begin() should indeed reset the cache (gen_begin_reuse() should work as well, but much faster in some cases), and you're appending every token produced by the forward pass so the cache should stay in sync with the sequence.

I'll have a look though. It shouldn't be too hard to spot if the cache and the sequence go out of sync somehow.

Jun 13 '23 12:06 turboderp

It's not yet that user-friendly to install, you need to clone the branch, run install_requirements.sh and then install the exllama package into the conda env with ./commandline.sh, pip install git+https://github.com/0cc4m/exllama. Then you can run it with ./play.sh

Jun 13 '23 13:06 0cc4m

I can confiirm my issue is no longer present after Occam's latest commit to his koboldAI fork, thank you very much for your help.

Jun 13 '23 15:06 calebmor460

But... I didn't fix anything yet.

Jun 13 '23 15:06 0cc4m

Me neither. I'm still struggling to get it to load a model. :)

Jun 13 '23 15:06 turboderp

@turboderp Let me know if you need help.

Jun 13 '23 16:06 0cc4m

I'm seeing a similar degradation in output quality. It used to match autogptq output quite closely but latest releases seem to be producing different results. I can get back previous quality results by setting

ExLlamaConfig.fused_attn = False

hope this can help chase things down.

Jun 13 '23 16:06 blauzim

Well, it's up and running. I was just using a model that didn't have any gptq_bits key in its config and I got stuck on why it wasn't being recognized. Kind of a lot going on in aiserver.py. Maybe you should refactor to less than 10k lines? ;) But it's fine now.

I had to skip the call to tpool.execute() in generate(), just calling model.core_generate() directly in order to debug in PyCharm, but I don't see that having any side effects in this case.

I'm just not seeing anything amiss. It's correctly resetting the cache on each pass, then generating one token at a time and the cache grows as it should, staying exactly one token behind the sequence, and there really isn't much else happening.

The output also looks reasonable. Just trying with 7B Llama, but with the storywriter preset it is telling me a very cute little story that doesn't seem to be degenerating with multiple passes. It does the thing that small models like to do where it starts repeating itself, but you can throw it off by adding in "Until suddenly..." or some such, and all that behaves as I'd expect.

If I swap out gen_begin with gen_begin_reuse, it even seems to be correctly reusing the cache and only re-evaluating the prompt from the first changed token, to further show that it's working. I'm not sure how useful that feature is in Kobold since you're not truncating the sequence in larger steps, so it would only accelerate things until the context is filled up. And prompt eval is really fast already, so idk.

But all in all... I can't find anything wrong at the moment.

Jun 13 '23 16:06 turboderp

The fused attention step is mathematically equivalent to the regular attention, but there might be slight differences related to numerical precision. Maybe if some of the sampling methods are extremely sensitive?

It would help if I could reproduce it. Exactly what model and settings are you using to make this happen?

Jun 13 '23 16:06 turboderp

Here's a adjusted snippet of the code - nothing too complicated. llama is a python class which executes a prompt. I've had the same issue / tried it with multiple different models from thebloke. It might just be a user issue with how I'm using the exllama code. I've setup my code to run with either exllama, autogptq, gptq-for-llama, and llama.cpp so have been comparing them and noticed this difference / issue.

llama.model_path = "models/Nous-Hermes-13B-GPTQ"
llama.tokenizer_model_path = llama.model_path + "/tokenizer.model"
llama.model_config_path = llama.model_path + "/config.json"
llama.model_safetensors_path = llama.model_path + "/" + [x for x in os.listdir(llama.model_path) if x.endswith('.safetensors')][0]
llama.config = ExLlamaConfig(llama.model_config_path)
llama.config.model_path = llama.model_safetensors_path
# llama.config.fused_attn = False
llama.config.max_seq_len = 2048
llama.model = ExLlama(llama.config)
llama.cache = ExLlamaCache(llama.model)
llama.tokenizer = ExLlamaTokenizer(llama.tokenizer_model_path)
llama.generator = ExLlamaGenerator(llama.model, llama.tokenizer, llama.cache)
llama.generator.settings.token_repetition_penalty_max = 1.2


with torch.no_grad():
    # torch.manual_seed(42)
    llama.generator.end_beam_search()

    ids = llama.generator.tokenizer.encode(prompt)
    #llama.generator.gen_begin(ids)
    llama.generator.gen_begin_reuse(ids)

    for i in range(request.max_tokens):
        token = llama.generator.gen_single_token()
        llama.generator.gen_prune_left
        if token.item() == llama.generator.tokenizer.eos_token_id: break
        for eos_token in stopping_criteria_list :
            if llama.generator.sequence_ends_with(eos_token) :
                break
    generated_ids = llama.generator.sequence[0][len(ids[0]):]
    generated_text = llama.generator.tokenizer.decode(generated_ids)

Jun 13 '23 17:06 blauzim

I'll have to try and see if I can reproduce it. One thing that stands out is the call to gen_prune_left() which I haven't looked at in ages. I think it's buggy when called during a beam search. Otherwise, calling it in the generation loop would continually reset the cache so performance would suffer a lot. Maybe it's just a copy/paste error?

Hermes is a model I haven't tested, though. I have found some finetunes to be strangely sensitive to rounding errors. I'll have to check that one out I guess.

Jun 13 '23 17:06 turboderp

I wrote a quick little script to try and spot any difference in the output between fused and regular attention:

from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import torch
import os, glob

torch.set_grad_enabled(False)
torch.cuda._lazy_init()

model_directory =  "/mnt/str/models/_test_models/TheBloke_GPT4All-13B-snoozy-GPTQ/"
# model_directory =  "/mnt/str/models/llama-13b-4bit-128g/"

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer, generator

config = ExLlamaConfig(model_config_path)
config.model_path = model_path
model = ExLlama(config)
cache = ExLlamaCache(model)
tokenizer = ExLlamaTokenizer(tokenizer_path)
generator = ExLlamaGenerator(model, tokenizer, cache)
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.6
generator.settings.top_p = 0.5

# Build a growing prompt

print ("")
print ("------------------- Regular attention --------------------")
print ("")

config.fused_attn = False
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)

print ("")
print ("------------------- Fused attention --------------------")
print ("")

config.fused_attn = True
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)

This seems to consistently produce roughly the same output.

Now, I say roughly, but it's important to note that even with a fixed seed the implementation is always ever so slightly non-deterministic, which comes down to floating-point addition being non-associative and CUDA providing no guarantees about the order in which threads are launched. The difference is always small, but it's made a little larger by the use of FP16, where some other implementations use FP32, at least for intermediate results.

It's larger still in the fused attention because at the very end I've optimized away the addition of the residual connection by just doing the last matmul straight on top of the residual state. Mathematically that's the same thing, but it does change the order of additions quite a bit for potentially a more different rounding behavior in the end.

Still, the differences are small in any case, and even though the generation happens in multiple steps, I'm just not seeing much divergence. And both are staying coherent, although that Hermes model really likes to write song lyrics for some reason. But it seems equally likely to do that with or without fused attention.

Jun 13 '23 19:06 turboderp

Thanks, I can run the sample code you provided have and it works cleanly. So must be an issue in the code I'm using / how exllama is being called. The code is trying to be general between all the various GPTQ implementations so might have some cruft causing issues. Will do more testing and see if i can find out why.

Jun 13 '23 23:06 blauzim

Did some further digging. Seems to be related to creating the generate and tokenizer objects inside the "llama" class. When created at the top level it works, but when the exllama objects are created in a class it has the fused attention difference. Could it be some scoping issues? For the sample below it generates different creative texts. But when used for instruction following it produces very bad results when fused attention is on.

output :

Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64

------------------- With Fused Attention == True version --------------------

Once upon a time, the only way to get your hands on new music was by waiting for it to come out or finding an underground tape trading scene. Nowadays you can stream and download songs instantly from anywhere in world with just few clicks of mouse button! The internet has also made sharing information about bands much easier than before – through social media sites like Facebook & Twitter as well blogs that cater specifically towards independent musicians (like this one). This makes discoverability so important because now anyone who wants access to their favorite band’s latest single without having any connection within industry gatekeepers such us record labels A&R people

Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64

------------------- With Fused Attention == False version --------------------

Once upon a time, the only way to get your hands on new music was by waiting for it to be released and then going out to buy […] Filed Under: Entertainment Tagged With: Apple Music, Beats 1 Radio Station

code :

from model import exllama, exllamacache, exllamaconfig
from tokenizer import exllamatokenizer
from generator import exllamagenerator
import torch
import os, glob

torch.set_grad_enabled(false)
torch.cuda._lazy_init()


######  set this to change from / to fused attention
_use_fused_attention = false

print ("")
print ("------------------- with fused attention == " + str(_use_fused_attention) + " version --------------------")
print ("")
model_directory =  "../../llama/nous-hermes-13b-gptq"
class llama_ex :
    model_directory :str | none = none
    tokenizer_model_path :str | none = none
    model_config_path :str | none = none
    model_safetensor_path :str | none = none
    n_ctx : int = 2048
    config : exllamaconfig  | none = none
    model : exllama | none = none
    cache : exllamacache | none = none
    tokenizer : exllamatokenizer | none = none
    generator : exllamagenerator  | none = none

    def __init__(self, *args, **kwargs):
        for this_param in list(set(dir(self)) & set(kwargs.keys())) :
            setattr(self, this_param, kwargs[this_param])
        self.model_tokenizer_path = os.path.join(self.model_directory, "tokenizer.model")
        self.model_config_path = os.path.join(self.model_directory, "config.json")
        self.model_safetensors_path = os.path.join(self.model_directory, [x for x in os.listdir(self.model_directory) if x.endswith('.safetensors')][0])
        self.config = exllamaconfig(self.model_config_path)
        self.config.model_path = self.model_safetensors_path
        self.config.fused_attn = _use_fused_attention
        self.model = exllama(self.config)
        self.cache = exllamacache(self.model)
        self.tokenizer = exllamatokenizer(self.model_tokenizer_path)
        self.generator = exllamagenerator(self.model, self.tokenizer, self.cache)

llama = llama_ex(model_directory = model_directory)

prompt = "once upon a time,"
llama.generator.settings.token_repetition_penalty_max = 1.5
llama.generator.settings.temperature = 0.5
llama.generator.settings.top_p = 0.1
llama.generator.settings.top_k = 40
gen_tokens = 128
torch.manual_seed(69420)

generated_text = llama.generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(generated_text)

Jun 14 '23 03:06 blauzim

so does this mean the fix has been rolled into the code, and if so, what files do I replace?

Jun 16 '23 02:06 calebmor460

There isn't a fix, no, because I haven't been able to reproduce the problem yet. I'm working on a thorough perplexity test to run with all the different possible code paths, which should highlight if there are any significant differences in how the model evaluates depending on tuning parameters.

I know there are some numerical differences, at least, and it's possible that this divergence is just the result of the model ending up at a "tipping point" and then going down one path or another based on some small shift in the probabilities. But that's not the same as poor output quality, though. There isn't a "correct" choice for any one token. So unless something is actually breaking and resulting in a broken probability distribution, what you really want is to avoid those tipping points in the first place.

I'll know more once these tests are set up. In the meantime you could try the new typical sampling feature, which does seems to produce more consistent results overall.

Jun 16 '23 21:06 turboderp

I will try that when I get a chance to, thank you

Jun 16 '23 22:06 calebmor460

For what it's worth, I've noticed output quality issues as well in Kobold, which I assumed was related to the sampling swap. However, I noticed similar issues with ooba's very recent exllama support, which doesn't touch exllama's native sampling.

One revealing thing. I was using Wizard-Vicuna-30b, which uses </s> as part of its prompt format. I noticed that I got "</s>" (as in the literal string, not the EOS token) creeping into the output, which never happened with normal transformers. This suggests that exllama is not interpreting </s> as a special token. If it doesn't check special_tokens_map.json, that would explain some things. In addition, I had issues with very early/jarring EOS, and contraction fumbling (emitting words like can'm, don've, etc.) which is normally only an issue with GPTQ models that don't use desc_act. Neither happened with regular GPTQ. The early stopping may be a symptom of incorrect interpretation of </s> in the prompt, but I'm not sure if that's plausible for contraction fumbling.

Jun 17 '23 01:06 QM60

Kobold doesn't use ExLlama's sampling, only logits from the model. Ooba does use the native sampling, though, as well as ExLlama's tokenizer which is just a straight SentencePiece instance reading the model file directly. I'll have to dig into the Transformers tokenizer to see if it does something special.

The special tokens map shouldn't lead to you seeing "</s>" in the output, especially when the file that defines that string isn't being read. I can look into some ways to take special_tokens_map.json into account, but it's going to be a little tricky when you have models on HF where that file looks like this:

{
  "bos_token": "</s>",
  "eos_token": "</s>",
  "pad_token": "[PAD]",
  "unk_token": "</s>"
}

The contractions are interesting, at least. Seems too oddly specific to not be a tokenizer issue, but I'm not sure what to make of it. I'll try to see if I can reproduce it. Have you seen it in Kobold too or just in Ooba?

Jun 17 '23 03:06 turboderp

Seen it in both, but it's happening constantly in ooba, every other reply. It's very weird. It manifests in a few ways: just forgetting to finish (doesn') finishing with a weird token (can'the) or cutting off the whole generation at a contraction (doesn'<EOS>). Again, the only other time I saw this was with 128g cuda models without act-order, but it was rarer, and I assume it was just quantization error. Fascinating that issues can manifest like this.

For the emitting </s> issue, I suspect this is happening because it's interpreted as a normal string (in the prompt from the chat history), causing the model to assume it should end generations with it. GPTQ parses it as EOS. This might be causing some issues, but I tried removing the EOS tokens from the prompt entirely, and the contraction glitches are still there. Weird. I hear you on the weird model configs, although for models that expect EOS in the prompt (like those trained on vicuna 1.1 formats) I should hope they didn't do that.

For what it's worth, the initial exllama branch in Kobold (which was very early, before most of your optimizations, or even support for non-groupsize models) didn't have any generation bugs at all that I could detect.

Jun 17 '23 04:06 QM60

exllama exllama copied to clipboard

Very poor output quality

output :

code :

exllama
exllama copied to clipboard