exllama
exllama copied to clipboard
Very poor output quality
I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters
I haven't seen this at all. What model are you using? And what settings?
Tried on Chronos 13b, wizard-lm 13b, and Pygmalion 7b, I used temperatures in between 0.5 and 1, and a context length of 2048, lower temperatures do seem to wrangle it into behaving a little more, but I have to lower the temperature so much the output is too “dry” to be useful . However using the same settings and models on normal GPTQ yields satisfactory results(Albeit at unsatisfactory speed)
And just to be clear, is this in ExLlama's web UI or in Ooba?
occam's fork of koboldAI that allows using exllama
using gptQ, said fork behaves normally
Not OP, but for context, the Kobold fork is here if you want to check it, turbo.
https://github.com/0cc4m/KoboldAI/tree/4bit-plugin (KoboldAI implementation to support GTPQ and exllama)
https://github.com/0cc4m/exllama (exllama fork on transformers branch, which builds exllama to work on Kobold)
They added Kobold samplers to that exllama fork.
So it seems these samplers are added
(I'm not sure about rep pen slope though)
Okay. I really have enough work cut out for me with this, but I guess I should try installing Kobold at some point to see how they're using it. I would assume they're just taking the logits and passing them to the same samplers they use for other models, and that should just work. But there are some peculiarities to keep in mind, specifically regarding the cache, and that "context tokens" slider looks a little suspect. But idk.
you think maybe the code wasn't hooked up to the context correctly and it's actually running on incredibly low context size?
I'm not sure what that slider does, but if it truncates the cache that would definitely lead to degenerate output since the position embeddings for cached entries would be wrong. But, looking at the Transformers wrapper they added I think it's just an issue with how the cache is being passed around. It has to stay in sync with the sequence for every forward pass.
E.g. if the model generates an EOS token, and their generator doesn't add that to the running sequence, it has to be removed from the cache. Or something similar along those lines. The cache being out of sync is the kind of thing which might leave it working poorly without crashing. But I'd have to install it and run it in a debugger to make sure. Which I will. After doing some other stuff first.
Alright then, thank you for taking a look at it
@turboderp Apologies for this, this should have gone to me directly.
I do use the KoboldAI samplers, here's the code if you're interested. It seems to work the first time or times you generate, but breaks afterwards. I'm not yet sure why. I do call generator.gen_begin(gen_in)
, which resets the cache as far as I know.
Yes, using the KoboldAI samplers is the obvious choice for integrating into Kobold, so that's great. There's nothing special about the logits, after all. In fact you should just be able to bypass ExLlamaGenerator
altogether and call the forward pass directly.
I'm going to install the 4-bit branch and have a play with it later today. But I don't see anything immediately wrong with how you're using it. gen_begin()
should indeed reset the cache (gen_begin_reuse()
should work as well, but much faster in some cases), and you're appending every token produced by the forward pass so the cache should stay in sync with the sequence.
I'll have a look though. It shouldn't be too hard to spot if the cache and the sequence go out of sync somehow.
It's not yet that user-friendly to install, you need to clone the branch, run install_requirements.sh
and then install the exllama package into the conda env with ./commandline.sh
, pip install git+https://github.com/0cc4m/exllama
. Then you can run it with ./play.sh
I can confiirm my issue is no longer present after Occam's latest commit to his koboldAI fork, thank you very much for your help.
But... I didn't fix anything yet.
Me neither. I'm still struggling to get it to load a model. :)
@turboderp Let me know if you need help.
I'm seeing a similar degradation in output quality. It used to match autogptq output quite closely but latest releases seem to be producing different results. I can get back previous quality results by setting
ExLlamaConfig.fused_attn = False
hope this can help chase things down.
Well, it's up and running. I was just using a model that didn't have any gptq_bits
key in its config and I got stuck on why it wasn't being recognized. Kind of a lot going on in aiserver.py. Maybe you should refactor to less than 10k lines? ;) But it's fine now.
I had to skip the call to tpool.execute()
in generate()
, just calling model.core_generate()
directly in order to debug in PyCharm, but I don't see that having any side effects in this case.
I'm just not seeing anything amiss. It's correctly resetting the cache on each pass, then generating one token at a time and the cache grows as it should, staying exactly one token behind the sequence, and there really isn't much else happening.
The output also looks reasonable. Just trying with 7B Llama, but with the storywriter preset it is telling me a very cute little story that doesn't seem to be degenerating with multiple passes. It does the thing that small models like to do where it starts repeating itself, but you can throw it off by adding in "Until suddenly..." or some such, and all that behaves as I'd expect.
If I swap out gen_begin
with gen_begin_reuse
, it even seems to be correctly reusing the cache and only re-evaluating the prompt from the first changed token, to further show that it's working. I'm not sure how useful that feature is in Kobold since you're not truncating the sequence in larger steps, so it would only accelerate things until the context is filled up. And prompt eval is really fast already, so idk.
But all in all... I can't find anything wrong at the moment.
The fused attention step is mathematically equivalent to the regular attention, but there might be slight differences related to numerical precision. Maybe if some of the sampling methods are extremely sensitive?
It would help if I could reproduce it. Exactly what model and settings are you using to make this happen?
Here's a adjusted snippet of the code - nothing too complicated. llama is a python class which executes a prompt. I've had the same issue / tried it with multiple different models from thebloke. It might just be a user issue with how I'm using the exllama code. I've setup my code to run with either exllama, autogptq, gptq-for-llama, and llama.cpp so have been comparing them and noticed this difference / issue.
llama.model_path = "models/Nous-Hermes-13B-GPTQ"
llama.tokenizer_model_path = llama.model_path + "/tokenizer.model"
llama.model_config_path = llama.model_path + "/config.json"
llama.model_safetensors_path = llama.model_path + "/" + [x for x in os.listdir(llama.model_path) if x.endswith('.safetensors')][0]
llama.config = ExLlamaConfig(llama.model_config_path)
llama.config.model_path = llama.model_safetensors_path
# llama.config.fused_attn = False
llama.config.max_seq_len = 2048
llama.model = ExLlama(llama.config)
llama.cache = ExLlamaCache(llama.model)
llama.tokenizer = ExLlamaTokenizer(llama.tokenizer_model_path)
llama.generator = ExLlamaGenerator(llama.model, llama.tokenizer, llama.cache)
llama.generator.settings.token_repetition_penalty_max = 1.2
with torch.no_grad():
# torch.manual_seed(42)
llama.generator.end_beam_search()
ids = llama.generator.tokenizer.encode(prompt)
#llama.generator.gen_begin(ids)
llama.generator.gen_begin_reuse(ids)
for i in range(request.max_tokens):
token = llama.generator.gen_single_token()
llama.generator.gen_prune_left
if token.item() == llama.generator.tokenizer.eos_token_id: break
for eos_token in stopping_criteria_list :
if llama.generator.sequence_ends_with(eos_token) :
break
generated_ids = llama.generator.sequence[0][len(ids[0]):]
generated_text = llama.generator.tokenizer.decode(generated_ids)
I'll have to try and see if I can reproduce it. One thing that stands out is the call to gen_prune_left()
which I haven't looked at in ages. I think it's buggy when called during a beam search. Otherwise, calling it in the generation loop would continually reset the cache so performance would suffer a lot. Maybe it's just a copy/paste error?
Hermes is a model I haven't tested, though. I have found some finetunes to be strangely sensitive to rounding errors. I'll have to check that one out I guess.
I wrote a quick little script to try and spot any difference in the output between fused and regular attention:
from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import torch
import os, glob
torch.set_grad_enabled(False)
torch.cuda._lazy_init()
model_directory = "/mnt/str/models/_test_models/TheBloke_GPT4All-13B-snoozy-GPTQ/"
# model_directory = "/mnt/str/models/llama-13b-4bit-128g/"
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]
# Create config, model, tokenizer, generator
config = ExLlamaConfig(model_config_path)
config.model_path = model_path
model = ExLlama(config)
cache = ExLlamaCache(model)
tokenizer = ExLlamaTokenizer(tokenizer_path)
generator = ExLlamaGenerator(model, tokenizer, cache)
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.6
generator.settings.top_p = 0.5
# Build a growing prompt
print ("")
print ("------------------- Regular attention --------------------")
print ("")
config.fused_attn = False
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)
print ("")
print ("------------------- Fused attention --------------------")
print ("")
config.fused_attn = True
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)
This seems to consistently produce roughly the same output.
Now, I say roughly, but it's important to note that even with a fixed seed the implementation is always ever so slightly non-deterministic, which comes down to floating-point addition being non-associative and CUDA providing no guarantees about the order in which threads are launched. The difference is always small, but it's made a little larger by the use of FP16, where some other implementations use FP32, at least for intermediate results.
It's larger still in the fused attention because at the very end I've optimized away the addition of the residual connection by just doing the last matmul straight on top of the residual state. Mathematically that's the same thing, but it does change the order of additions quite a bit for potentially a more different rounding behavior in the end.
Still, the differences are small in any case, and even though the generation happens in multiple steps, I'm just not seeing much divergence. And both are staying coherent, although that Hermes model really likes to write song lyrics for some reason. But it seems equally likely to do that with or without fused attention.
Thanks, I can run the sample code you provided have and it works cleanly. So must be an issue in the code I'm using / how exllama is being called. The code is trying to be general between all the various GPTQ implementations so might have some cruft causing issues. Will do more testing and see if i can find out why.
Did some further digging. Seems to be related to creating the generate and tokenizer objects inside the "llama" class. When created at the top level it works, but when the exllama objects are created in a class it has the fused attention difference. Could it be some scoping issues? For the sample below it generates different creative texts. But when used for instruction following it produces very bad results when fused attention is on.
output :
Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64
------------------- With Fused Attention == True version --------------------
Once upon a time, the only way to get your hands on new music was by waiting for it to come out or finding an underground tape trading scene. Nowadays you can stream and download songs instantly from anywhere in world with just few clicks of mouse button! The internet has also made sharing information about bands much easier than before – through social media sites like Facebook & Twitter as well blogs that cater specifically towards independent musicians (like this one). This makes discoverability so important because now anyone who wants access to their favorite band’s latest single without having any connection within industry gatekeepers such us record labels A&R people
Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64
------------------- With Fused Attention == False version --------------------
Once upon a time, the only way to get your hands on new music was by waiting for it to be released and then going out to buy […] Filed Under: Entertainment Tagged With: Apple Music, Beats 1 Radio Station
code :
from model import exllama, exllamacache, exllamaconfig
from tokenizer import exllamatokenizer
from generator import exllamagenerator
import torch
import os, glob
torch.set_grad_enabled(false)
torch.cuda._lazy_init()
###### set this to change from / to fused attention
_use_fused_attention = false
print ("")
print ("------------------- with fused attention == " + str(_use_fused_attention) + " version --------------------")
print ("")
model_directory = "../../llama/nous-hermes-13b-gptq"
class llama_ex :
model_directory :str | none = none
tokenizer_model_path :str | none = none
model_config_path :str | none = none
model_safetensor_path :str | none = none
n_ctx : int = 2048
config : exllamaconfig | none = none
model : exllama | none = none
cache : exllamacache | none = none
tokenizer : exllamatokenizer | none = none
generator : exllamagenerator | none = none
def __init__(self, *args, **kwargs):
for this_param in list(set(dir(self)) & set(kwargs.keys())) :
setattr(self, this_param, kwargs[this_param])
self.model_tokenizer_path = os.path.join(self.model_directory, "tokenizer.model")
self.model_config_path = os.path.join(self.model_directory, "config.json")
self.model_safetensors_path = os.path.join(self.model_directory, [x for x in os.listdir(self.model_directory) if x.endswith('.safetensors')][0])
self.config = exllamaconfig(self.model_config_path)
self.config.model_path = self.model_safetensors_path
self.config.fused_attn = _use_fused_attention
self.model = exllama(self.config)
self.cache = exllamacache(self.model)
self.tokenizer = exllamatokenizer(self.model_tokenizer_path)
self.generator = exllamagenerator(self.model, self.tokenizer, self.cache)
llama = llama_ex(model_directory = model_directory)
prompt = "once upon a time,"
llama.generator.settings.token_repetition_penalty_max = 1.5
llama.generator.settings.temperature = 0.5
llama.generator.settings.top_p = 0.1
llama.generator.settings.top_k = 40
gen_tokens = 128
torch.manual_seed(69420)
generated_text = llama.generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(generated_text)
so does this mean the fix has been rolled into the code, and if so, what files do I replace?
There isn't a fix, no, because I haven't been able to reproduce the problem yet. I'm working on a thorough perplexity test to run with all the different possible code paths, which should highlight if there are any significant differences in how the model evaluates depending on tuning parameters.
I know there are some numerical differences, at least, and it's possible that this divergence is just the result of the model ending up at a "tipping point" and then going down one path or another based on some small shift in the probabilities. But that's not the same as poor output quality, though. There isn't a "correct" choice for any one token. So unless something is actually breaking and resulting in a broken probability distribution, what you really want is to avoid those tipping points in the first place.
I'll know more once these tests are set up. In the meantime you could try the new typical sampling feature, which does seems to produce more consistent results overall.
I will try that when I get a chance to, thank you
For what it's worth, I've noticed output quality issues as well in Kobold, which I assumed was related to the sampling swap. However, I noticed similar issues with ooba's very recent exllama support, which doesn't touch exllama's native sampling.
One revealing thing. I was using Wizard-Vicuna-30b, which uses </s>
as part of its prompt format. I noticed that I got "</s>"
(as in the literal string, not the EOS token) creeping into the output, which never happened with normal transformers. This suggests that exllama is not interpreting </s>
as a special token. If it doesn't check special_tokens_map.json, that would explain some things.
In addition, I had issues with very early/jarring EOS, and contraction fumbling (emitting words like can'm, don've, etc.) which is normally only an issue with GPTQ models that don't use desc_act. Neither happened with regular GPTQ. The early stopping may be a symptom of incorrect interpretation of </s>
in the prompt, but I'm not sure if that's plausible for contraction fumbling.
Kobold doesn't use ExLlama's sampling, only logits from the model. Ooba does use the native sampling, though, as well as ExLlama's tokenizer which is just a straight SentencePiece instance reading the model file directly. I'll have to dig into the Transformers tokenizer to see if it does something special.
The special tokens map shouldn't lead to you seeing "</s>" in the output, especially when the file that defines that string isn't being read. I can look into some ways to take special_tokens_map.json into account, but it's going to be a little tricky when you have models on HF where that file looks like this:
{
"bos_token": "</s>",
"eos_token": "</s>",
"pad_token": "[PAD]",
"unk_token": "</s>"
}
The contractions are interesting, at least. Seems too oddly specific to not be a tokenizer issue, but I'm not sure what to make of it. I'll try to see if I can reproduce it. Have you seen it in Kobold too or just in Ooba?
Seen it in both, but it's happening constantly in ooba, every other reply. It's very weird. It manifests in a few ways: just forgetting to finish (doesn'
) finishing with a weird token (can'the
) or cutting off the whole generation at a contraction (doesn'<EOS>
). Again, the only other time I saw this was with 128g cuda models without act-order, but it was rarer, and I assume it was just quantization error. Fascinating that issues can manifest like this.
For the emitting </s>
issue, I suspect this is happening because it's interpreted as a normal string (in the prompt from the chat history), causing the model to assume it should end generations with it. GPTQ parses it as EOS. This might be causing some issues, but I tried removing the EOS tokens from the prompt entirely, and the contraction glitches are still there. Weird.
I hear you on the weird model configs, although for models that expect EOS in the prompt (like those trained on vicuna 1.1 formats) I should hope they didn't do that.
For what it's worth, the initial exllama branch in Kobold (which was very early, before most of your optimizations, or even support for non-groupsize models) didn't have any generation bugs at all that I could detect.