outlines stop_at doesn't stop generation early when using Exllamav2

Describe the issue as clearly as possible:

When I add a stop_at token to outlines.generate.text it seems to generate all tokens behind the scene, and then strips anything after the stop_at token instead of stopping generation early.

Steps/code to reproduce the bug:

import outlines 
from huggingface_hub import snapshot_download
import time 

model_name="TheBloke/openchat-3.5-0106-GPTQ"
revision="gptq-4bit-32g-actorder_True"
model_directory = snapshot_download(repo_id=model_name, revision=revision)
llm = outlines.models.exl2(model_directory,
                             model_kwargs= { "device_map": "cuda", 
                                             "attn_implementation":"flash_attention_2",
                                             "num_experts_per_token":1
                              }, device='cuda')


prompt = 'You only respond with the word "Yes"<|end_of_turn|>\nGPT4 Correct User: Are you ready?<|end_of_turn|>\nGPT4 Correct Assistant: '

sampler = outlines.samplers.multinomial(temperature=0.1)
generator = outlines.generate.text(llm,sampler)

for max_tokens in [2,10,100,1000]:
    kwargs = {'stop_at': ['<|end_of_turn|>'], 'max_tokens': max_tokens}
    st = time.perf_counter()
    text = generator(prompt,**kwargs)
    print('time',time.perf_counter() - st)
    print('max_tokens',str(max_tokens))
    print('result',text)
    print()

Expected result:

I would expect time reported in all four cases to be the same.

Error message:

time 0.10959948902018368
max_tokens 2
result Yes

time 0.2646058950049337
max_tokens 10
result Yes

time 2.027848708006786
max_tokens 100
result Yes

time 18.766647854994517
max_tokens 1000
result Yes

Outlines/Python version information:

Version information

Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] outlines== 0.0.36 flash-attn==2.3.6 exllamav2==0.0.15

Context for the issue:

I am writing a chat bot style project where I don't know exactly how long responses will be and this bug increases latency significantly.

Mar 15 '24 01:03 chrsbats

The issue seems to be related to the SequenceGenerator.is_stop_sequence_found method.

The stop token is getting stripped from the end of the generated_sequences that are being passed into this function and it can't detect so it is always returning False.

Might be because <|end_of_turn|> is a special token added to the model?

Mar 15 '24 01:03 chrsbats

Ok, there is a skip_special_tokens flag set in outlines.models.transformers.TransformerTokenizer.decode, which is probably stripping the special <|end_of_turn|> token.

def decode(self, token_ids: torch.LongTensor) -> List[str]:
    text = self.tokenizer.batch_decode(token_ids, skip_special_tokens=False)
    return text

I've confirmed that if I change the prompt to:

'You always respond as follows <output> Yes, I am ready. </output> <|end_of_turn|> GPT4 Correct User: Are you ready? <|end_of_turn|> GPT4 Correct Assistant: <output>'

and set the stop token to </output> it seems to work as expected:

time 0.2600345800165087 max_tokens 2 result Yes,

time 0.1344133550010156 max_tokens 10 result Yes, I am ready. </output>

time 0.13286997901741415 max_tokens 100 result Yes, I am ready. </output>

time 0.13070855499245226 max_tokens 1000 result Yes, I am ready. </output>

So this bug might be a don't fix. It's a bit of a gotcha though?

Mar 15 '24 03:03 chrsbats

We did not have chat models in mind when building the library, and so this is not completely surprising. We could change this behavior, which is indeed a bit of a gotcha, but I am not sure yet. It should at least be documented.

Mar 18 '24 08:03 rlouf