stop_at doesn't stop generation early when using Exllamav2
Describe the issue as clearly as possible:
When I add a stop_at token to outlines.generate.text it seems to generate all tokens behind the scene, and then strips anything after the stop_at token instead of stopping generation early.
Steps/code to reproduce the bug:
import outlines
from huggingface_hub import snapshot_download
import time
model_name="TheBloke/openchat-3.5-0106-GPTQ"
revision="gptq-4bit-32g-actorder_True"
model_directory = snapshot_download(repo_id=model_name, revision=revision)
llm = outlines.models.exl2(model_directory,
model_kwargs= { "device_map": "cuda",
"attn_implementation":"flash_attention_2",
"num_experts_per_token":1
}, device='cuda')
prompt = 'You only respond with the word "Yes"<|end_of_turn|>\nGPT4 Correct User: Are you ready?<|end_of_turn|>\nGPT4 Correct Assistant: '
sampler = outlines.samplers.multinomial(temperature=0.1)
generator = outlines.generate.text(llm,sampler)
for max_tokens in [2,10,100,1000]:
kwargs = {'stop_at': ['<|end_of_turn|>'], 'max_tokens': max_tokens}
st = time.perf_counter()
text = generator(prompt,**kwargs)
print('time',time.perf_counter() - st)
print('max_tokens',str(max_tokens))
print('result',text)
print()
Expected result:
I would expect time reported in all four cases to be the same.
Error message:
time 0.10959948902018368
max_tokens 2
result Yes
time 0.2646058950049337
max_tokens 10
result Yes
time 2.027848708006786
max_tokens 100
result Yes
time 18.766647854994517
max_tokens 1000
result Yes
Outlines/Python version information:
Version information
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] outlines== 0.0.36 flash-attn==2.3.6 exllamav2==0.0.15
Context for the issue:
I am writing a chat bot style project where I don't know exactly how long responses will be and this bug increases latency significantly.
The issue seems to be related to the SequenceGenerator.is_stop_sequence_found method.
The stop token is getting stripped from the end of the generated_sequences that are being passed into this function and it can't detect so it is always returning False.
Might be because <|end_of_turn|> is a special token added to the model?
Ok, there is a skip_special_tokens flag set in outlines.models.transformers.TransformerTokenizer.decode, which is probably stripping the special <|end_of_turn|> token.
def decode(self, token_ids: torch.LongTensor) -> List[str]:
text = self.tokenizer.batch_decode(token_ids, skip_special_tokens=False)
return text
I've confirmed that if I change the prompt to:
'You always respond as follows <output> Yes, I am ready. </output> <|end_of_turn|> GPT4 Correct User: Are you ready? <|end_of_turn|> GPT4 Correct Assistant: <output>'
and set the stop token to </output> it seems to work as expected:
time 0.2600345800165087 max_tokens 2 result Yes,
time 0.1344133550010156 max_tokens 10 result Yes, I am ready. </output>
time 0.13286997901741415 max_tokens 100 result Yes, I am ready. </output>
time 0.13070855499245226 max_tokens 1000 result Yes, I am ready. </output>
So this bug might be a don't fix. It's a bit of a gotcha though?
We did not have chat models in mind when building the library, and so this is not completely surprising. We could change this behavior, which is indeed a bit of a gotcha, but I am not sure yet. It should at least be documented.