lm-format-enforcer
lm-format-enforcer copied to clipboard
Performance issue
Using lm-format-enforcer makes generating take 3x longer for me, am I doing something wrong? With: Took 67.46 seconds Without: Took 22.94 seconds
class TweetGenFormat(BaseModel):
tweet: str
tweet_quality_reasoning: str
rating_1_to_9: float
sounds_awkward: bool
is_about_ai: bool
is_about_climate_change: bool
list_of_topics: list
# when_to_post: datetime
# tweet_url: HttpUrl
# tweet_url: str
def generate_tweets():
prompt = 'Generate a great tweet. Respond with a json object.'
prompt_whole = get_prompt_whole(prompt)
parser = JsonSchemaParser(TweetGenFormat.model_json_schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
batch_size = 16
tokens = tokenizer(
# prompt_template,
[prompt_whole] * batch_size,
return_tensors='pt'
).input_ids.cuda()
start = time.time()
generation_output = model.generate(
tokens,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=2**10,
prefix_allowed_tokens_fn=prefix_function,
)
print(f'Took {time.time() - start:.2f} seconds')
start = time.time()
generation_output = model.generate(
tokens,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=2**10,
# prefix_allowed_tokens_fn=prefix_function,
)
print(f'Took {time.time() - start:.2f} seconds')
$ nvidia-smi
Fri Dec 8 17:41:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:00:10.0 Off | 0 |
| N/A 23C P0 34W / 250W | 39322MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB On | 00000000:00:11.0 Off | 0 |
| N/A 24C P0 39W / 250W | 37567MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
Have you tried Guidance, Guardrails, or Outlines? I'm just curious how those compare in terms of time.
There are a few reasons for this type of behavior. One thing to make sure - is the length of the result the same with and without the enforcer? Tokens per second is the correct apples to apples comparison metric. Can you print that as well?
All the other libraries I tried didn't have batch processing (@mhillebrand). Using the hf pipeline seems pretty slow for some reason, time scales with batch size.
Pipeline - Batch size: 1, time: 6.89s
Generate (schema) - Batch size: 1, time: 6.28s
Generate (no schema) - Batch size: 1, time: 6.01s
--------------------------------------------------------------------------------
Pipeline - Batch size: 2, time: 13.03s
Generate (schema) - Batch size: 2, time: 7.00s
Generate (no schema) - Batch size: 2, time: 6.82s
--------------------------------------------------------------------------------
Pipeline - Batch size: 4, time: 25.84s
Generate (schema) - Batch size: 4, time: 8.33s
Generate (no schema) - Batch size: 4, time: 9.24s
--------------------------------------------------------------------------------
Pipeline - Batch size: 8, time: 51.42s
Generate (schema) - Batch size: 8, time: 12.22s
Generate (no schema) - Batch size: 8, time: 14.00s
--------------------------------------------------------------------------------
Generate (schema) - Batch size: 16, time: 12.87s
Generate (no schema) - Batch size: 16, time: 15.06s
--------------------------------------------------------------------------------
Generate (schema) - Batch size: 32, time: 15.15s
Generate (no schema) - Batch size: 32, time: 17.66s
--------------------------------------------------------------------------------
Generate (schema) - Batch size: 64, time: 18.58s
Generate (no schema) - Batch size: 64, time: 17.24s
--------------------------------------------------------------------------------
Generate (schema) - Batch size: 128, time: 25.47s
Generate (no schema) - Batch size: 128, time: 23.68s
--------------------------------------------------------------------------------
Generate (schema) - Batch size: 256, time: 35.83s
Generate (no schema) - Batch size: 256, time: 31.43s
Code:
for i in range(100):
batch_size = 2 ** i
prompts = [prompt] * batch_size
if batch_size <= 8:
start = time.time()
output_dict = hf_pipeline(prompts, prefix_allowed_tokens_fn=prefix_function,
max_new_tokens=2**6,
)
# print("output_dict:", output_dict)
print(f'Pipeline - Batch size: {batch_size}, time: {time.time() - start:.2f}s')
start = time.time()
tokens = tokenizer(
# prompt_template,
[prompt] * batch_size,
return_tensors='pt'
).input_ids.cuda()
# # Generate output
generation_output = model.generate(
tokens,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=2**6,
prefix_allowed_tokens_fn=prefix_function,
# prefix_allowed_tokens_fn=custom_prefix_function,
)
# print("generation_output:", generation_output)
output_shape = generation_output.shape
# print("output_shape:", output_shape)
decoded_output = tokenizer.decode(generation_output[0], skip_special_tokens=False)
# print("decoded_output:", decoded_output)
print(f'Generate (schema) - Batch size: {batch_size}, time: {time.time() - start:.2f}s')
start = time.time()
tokens = tokenizer(
# prompt_template,
[prompt] * batch_size,
return_tensors='pt'
).input_ids.cuda()
# # Generate output
generation_output = model.generate(
tokens,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=40,
max_new_tokens=2**6,
# prefix_allowed_tokens_fn=prefix_function,
# prefix_allowed_tokens_fn=custom_prefix_function,
)
# print("generation_output:", generation_output)
print(f'Generate (no schema) - Batch size: {batch_size}, time: {time.time() - start:.2f}s')
print('-' * 80)
Hi, from the last performance table, it looks like schema and no-schema take roughly the same amount of time. So it doesn't look like the format enforcer is a major performance hurdle (again, I would print time-per-token, not just total time, as result length may differ). Do you interpret the results differently?
Generation in the above case gets cut off after 32 tokens, so it should generate roughly the same number of tokens between the different generation modes.
You are right that there doesn't seem to be a difference between the model.generate calls, but using it with pipeline seems to not support batching. For the last example with batch size 256 that would likely mean 50x slower generation when using it with pipeline. If that is expected or irrelevant I don't mind if we close the issue.
@noamgat have you considered using Cython to enhance performance?
I've been experimenting with refactoring some of the utility parsers (like Sequence and Union) to create a parser for a DSL, and the performance improvement is significant, even with my basic C/C++ skills.
The main downside is managing the build script, but this has been relatively minor. Below is an example of how I structured it (implementation details omitted):
cdef class CharacterLevelParser:
def __init__(self):
pass
def __cinit__(self):
pass
cpdef CharacterLevelParser _clone(self):
"""
Clone the current parser.
Returns:
CharacterLevelParser: A clone of the current parser.
"""
pass
def add_character(self, new_character: str) -> CharacterLevelParser:
new_char = ord(new_character[0])
return self._add_character(new_char)
cpdef CharacterLevelParser _add_character(self, char new_character):
pass
def get_allowed_characters(self) -> str:
cdef list char_list = [chr(c) for c in self._get_allowed_characters()]
return "".join(char_list)
cpdef cset[char] _get_allowed_characters(self):
pass
cpdef bint can_end(self):
"""
Check if the parser can end.
Returns:
bool: True if the parser can end, False otherwise.
"""
pass
cpdef object cache_key(self):
"""
Get the cache key for the parser.
Returns:
object: The cache key.
"""
pass
Have you explored similar optimizations, or do you have any thoughts on integrating Cython into the project?
@tom-doerr I am skeptical that this is more relevant to huggingface's framework itself. Their logits processor with the same workload is just 10 times slower than vllm or exllamav2. One observation is that if I force the cuda to sync in the beginning and the end of a vllm's logits processor than it runs as slow as the huggingface, while if I do the same in huggingface then no speed difference is observed. I am not sure what this reveals though because logits masking inherently requires at least one cuda sync(since your mask is created on CPU.)
mark