Some inferences take forever to complete
Issue description
The issue was raised by other people on Discord too.
To quote one of them:
I'm running the same query 10 times (with equivalent prompts and output sizes), but some inferences are taking abnormally longer than others.
their screenshot
Repro
I made a reproduction code snippet that can run in Google Colab (w/ free T4 GPU):
💻 Code snippet
pip install outlines==0.0.13 transformers datasets optimum auto-gptq accelerate
from outlines import models
from outlines.text.generate import json, continuation
from json import dumps
from time import perf_counter
import torch
prompt = """<|system|>
You are a friendly AI assistant.
You're specialized in mathematics and open source Github repositories.
Your answers must be concise and factual.</s>
<|user|>
Write a very long poem</s>
<|assistant|>
"""
output_format = {
"type": "object",
"properties": {
"poem": {"type": "string"}
}
}
model = models.transformers("TheBloke/zephyr-7B-beta-GPTQ", device="cuda")
rng = torch.Generator(device="cuda")
rng.manual_seed(789001)
errors = []
for i in range(20):
start_time = perf_counter()
try:
sequence = json(model, dumps(output_format))(prompt, rng=rng)
poem = sequence.get('poem')
elapsed_time = round(perf_counter() - start_time)
n_characters_per_second = len(poem) // elapsed_time
print(f"{i}\t{elapsed_time}\t{n_characters_per_second}\t{poem[:30]}..")
except Exception as e:
errors.append(e)
print(f"{i}\t{elapsed_time}\tINFERENCE FAILED")
📃 Output
0 14 76 In the vastness of cosmic spac..
1 14 INFERENCE FAILED
2 769 0 In this universe, a vast expan..
3 389 0 In ancient lands, where skies ..
4 16 67 In the depths of the cosmos, w..
5 35 70 In the stillness of the mornin..
6 32 60 In a universe vast and unceasi..
7 13 77 75000 lines of blank verse, hi..
8 22 69 In a land of purest light, Who..
9 34 59 A cosmic dance of stars, a sym..
10 49 68 In the land of the digit, wher..
11 34 78 In a world vast and unknown, ..
12 43 68 There was a time when words we..
13 54 70 In a world where chaos reigns..
14 12 62 Let the words unfurl like the ..
15 330 0 Infinity beckons from the far ..
16 31 60 In the depths of the universe,..
17 137 0 In this vast expanse of time a..
18 32 81 in this universe vast and unfa..
💥 Exceptions raised
import traceback
for error in errors:
try:
raise error
except Exception as e:
traceback.print_exc()
Traceback (most recent call last):
File "<ipython-input-6-d8471672a411>", line 5, in <cell line: 3>
raise error
File "<ipython-input-5-1a425bb0404a>", line 8, in <cell line: 5>
sequence = json(model, dumps(output_format))(prompt, rng=rng)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/sequence.py", line 240, in __call__
result = self.postprocess_completions(result)
File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/regex.py", line 226, in postprocess_completions
return [self.format_fn(result) for result in results]
File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/regex.py", line 226, in <listcomp>
return [self.format_fn(result) for result in results]
File "/usr/local/lib/python3.10/dist-packages/outlines/text/generate/regex.py", line 397, in <lambda>
format_fn = lambda x: pyjson.loads(x)
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 2 column 570 (char 571)
Results
- ✅ 14 inferences succeeded fast
- ⏰ 5 inferences succeeded but were extremely slow (indices: 2, 3, 15, 17, 19)
- 💥 1 inference failed fast (index: 1)
Outlines/Python version information:
Outlines 0.0.13
Python 3.10.12
Thank you so much for the detailed report! Will come back to you shortly.
These timing results contain significant non-inference setup steps (e.g. json(model, dumps(output_format))).
Yes indeed!
json(model, dumps(output_format)) takes a few seconds to complete and shouldn't be in the for-loop.
But this is not the step that gets "stuck".
It would still be nice to have results without having it in the loop, and use cProfile to understand which step "gets stuck". To get to similar experimental conditions I would also use the maxLength field constraint.
Please try
class OutputModel(BaseModel):
poem: str
And pass OutputModel instead output_format. This schema ensures 'required': ['poem'] attr is included and you don't have any generations missing the poem key.
Additionally, you will need to set whitespace_pattern as explained here https://github.com/outlines-dev/outlines/issues/690#issuecomment-2102291934
json(model, dumps(output_format), whitespace_pattern=r"[ ]?")...
With these changes your script works for me and doesn't have any slow or failed inference.