transformers
transformers copied to clipboard
Add "Fill-in-Middle" pipeline
What does this PR do?
This PR adds the Fill-in-Middle pipeline to 🤗 transformers.
FIM objective was proposed in Efficient Training of Language Models to Fill in the Middle. They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.
As discussed in #27059
Who can review?
@sayakpaul @ArthurZucker
This PR is currently WIP and the pipeline code is copied from the text_generation
pipeline. Opened this PR for discussion on implementation details.
@ArthurZucker, since we don't have a Dict with all the compatible models (like MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
in text_generation), will a list (example below) be a good starting point?
if is_torch_available():
FIM_SUPPORTED_MODELS_TORCH = [
"codellama/CodeLlama-7b-hf",
"codellama/CodeLlama-13b-hf",
"codellama/CodeLlama-34b-hf",
"codellama/CodeLlama-7b-Python-hf",
"codellama/CodeLlama-13b-Python-hf",
"codellama/CodeLlama-34b-Python-hf",
...
]
Also, here's what I am doing in the implementation of this:
Instead of expecting the end user to pass three FIM tokens in the input, we can just go ahead with one token called <FILL_ME>
(inspired by what was done in CodeLlama).
The prefix and suffix tokens will be loaded from the tokenizer but the user's input will only contain the <FILL_ME>
token using which the text will be split to get prefix and suffix text (as CodeLlama's tokenizer does it here).
I will also add an option in the pipeline to return the filled text or the total text (including the prompt).
Hi @ArthurZucker, could you please review this PR?
FIMPipeline
works and is compatible with CodeLlama family of models.
Sure, could you make sure the CIs are green?
@ArthurZucker The documentation CI is failing consistently with the following error:
ERROR src/transformers/pipelines/fill_in_middle.py - ValueError: line 14 of the docstring for transformers.pipelines.fill_in_middle.FIMPipeline has inconsistent leading whitespace: 'def fib(x: int) -> int:'
However, in the example of the FIM pipeline in the docstring, the spaces are a part of the generated code output.
>>> from transformers import pipeline
>>> PROMPT = '''
def fib(x: int) -> int:
<FILL_ME>
return fib(x-1) + fib(x-2)
'''
>>> generator = pipeline(model="codellama/CodeLlama-7b-hf")
>>> generator(PROMPT, do_sample=False)
[{'generated_text': "\ndef fib(x: int) -> int:\n\tif x == 0:\n\t\treturn 0\n\tif x == 1:\n\t\treturn 1\n\telse:\n\t\treturn fib(x-1) + fib(x-2)\n"}]
Should I ignore this?
you can probably ignore it with # doctest: +SKIP
Also cc @Rocketknight1
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Thanks for the detailed review, @ArthurZucker!
I also noticed that my auto-formatting setting in Black has made some unwanted changes in the text_generation.py
file. Also reverting that file to the original branch version.
No worries, ping me whenever for another review!
@ArthurZucker About your suggestion on merging the CodeLlama FIM logic with other model variants: It would complicate things a lot in the pipeline code as it will require a lot of code from the CodeLlama tokenizer to be included directly in the pipeline code since the tokenizer deals with much of the FIM related operations in the tokenize()
and other functions.
Keeping them separate (i.e; For models that support a <FILL_ME>
token out of the box vs. those that don't support it like starcoder) will be easier to debug and maintain and will remove redundancy.
FYI, I am thinking of something like this:
The supported model dict will have all model names, corresponding FIM tokens and if they support Out of the box Infilling or not (if they don't, it will be None in place of infill token)
SUPPORTED_MODELS = {
"bigcode/starcoder": ("<fim_prefix>", "<fim_middle>", "<fim_suffix>", None),
"codellama/CodeLlama-7b-hf": ("▁<PRE>", "▁<MID>", "▁<SUF>", "<FILL_ME>"),
# other models
}
...
infill_token = self.SUPPORTED_MODELS[self.model.name_or_path]
if not infill_token:
# Extract prefix and suffix
input_prefix, input_suffix = self.extract_prefix_suffix(prompt_text, infill_token)
if mode == "psm":
prompt_text = (
self.DEFAULT_PREFIX_TOKEN
+ input_prefix
+ self.DEFAULT_SUFFIX_TOKEN
+ input_suffix
+ self.DEFAULT_MIDDLE_TOKEN
)
else:
prompt_text = (
self.DEFAULT_SUFFIX_TOKEN
+ input_suffix
+ self.DEFAULT_PREFIX_TOKEN
+ input_prefix
+ self.DEFAULT_MIDDLE_TOKEN
)
else:
# Tokenizer directly, since the Infill token is supported out of the box and doesn't need any token re-arrangement
...
Please let me know your opinion on this!
IMO that is exactly the purpose of this pipeline. The functions should not necessarily have been part of the tokenizer as they are only need for the FIM task. So I am more saying let's not rely on any tokenizer, but handle the processing in the pipeline
IMO that is exactly the purpose of this pipeline. The functions should not necessarily have been part of the tokenizer as they are only need for the FIM task. So I am more saying let's not rely on any tokenizer, but handle the processing in the pipeline
I see, thanks for clarifying, will push the updates soon!
@ArthurZucker I added the naming and init file changes as you requested. Working on the tests by taking the text-generation tests as a reference.
One doubt: The smallest model that supports FIM (that I know of) is codellama-7b
which is still a pretty huge model and should use GPU for testing. Is there a way to do the tests without using such a big model?
You should be able to use 4bit quantization!
@ArthurZucker Once the generation config thing is clear, I will start adding tests.
Here are the tests I am thinking of adding:
-
test_model
- Tests a small codellama-7b model on the Fibonacci sequence problem -
test_infill_mode
- Tests the tokenization for different infill modes -
test_missing_infill_token
- Tests if the model throws an error when encountering a missing token in the prompt text -
test_multiple_infill_tokens
- Tests if the model throws an error when multiple infill tokens are provided in the prompt text -
test_return_text
- Tests if the model returns full text or only infilled text
How does this sound?
Sounds good. No need for the generation config update . Tokens are string so should be saved in the tokenizer_config.json IMO
Sounds good. No need for the generation config update . Tokens are string so should be saved in the tokenizer_config.json IMO
@ArthurZucker But for that, we need to save them to the tokenizer_config.json first for the tests to pass right?
Yes, you can probably open a PR to the models and use the revision
! WDYT?
Hi @ArthurZucker, I was out during easter and will work on this now. Sorry for the delay!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.