transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Add "Fill-in-Middle" pipeline

Open tanaymeh opened this issue 1 year ago • 21 comments

What does this PR do?

This PR adds the Fill-in-Middle pipeline to 🤗 transformers.

FIM objective was proposed in Efficient Training of Language Models to Fill in the Middle. They showed that autoregressive language models can learn to infill text after applying a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end.

As discussed in #27059

Who can review?

@sayakpaul @ArthurZucker

tanaymeh avatar Dec 04 '23 11:12 tanaymeh

This PR is currently WIP and the pipeline code is copied from the text_generation pipeline. Opened this PR for discussion on implementation details.

@ArthurZucker, since we don't have a Dict with all the compatible models (like MODEL_FOR_CAUSAL_LM_MAPPING_NAMES in text_generation), will a list (example below) be a good starting point?

if is_torch_available():
    FIM_SUPPORTED_MODELS_TORCH = [
        "codellama/CodeLlama-7b-hf",
        "codellama/CodeLlama-13b-hf",
        "codellama/CodeLlama-34b-hf",
        "codellama/CodeLlama-7b-Python-hf",
        "codellama/CodeLlama-13b-Python-hf",
        "codellama/CodeLlama-34b-Python-hf",
        ...
    ]

tanaymeh avatar Dec 04 '23 11:12 tanaymeh

Also, here's what I am doing in the implementation of this:

Instead of expecting the end user to pass three FIM tokens in the input, we can just go ahead with one token called <FILL_ME> (inspired by what was done in CodeLlama).

The prefix and suffix tokens will be loaded from the tokenizer but the user's input will only contain the <FILL_ME> token using which the text will be split to get prefix and suffix text (as CodeLlama's tokenizer does it here).

I will also add an option in the pipeline to return the filled text or the total text (including the prompt).

tanaymeh avatar Dec 04 '23 11:12 tanaymeh

Hi @ArthurZucker, could you please review this PR? FIMPipeline works and is compatible with CodeLlama family of models.

tanaymeh avatar Jan 20 '24 18:01 tanaymeh

Sure, could you make sure the CIs are green?

ArthurZucker avatar Jan 23 '24 15:01 ArthurZucker

@ArthurZucker The documentation CI is failing consistently with the following error:

ERROR src/transformers/pipelines/fill_in_middle.py - ValueError: line 14 of the docstring for transformers.pipelines.fill_in_middle.FIMPipeline has inconsistent leading whitespace: 'def fib(x: int) -> int:'

However, in the example of the FIM pipeline in the docstring, the spaces are a part of the generated code output.

>>> from transformers import pipeline
>>> PROMPT = '''
        def fib(x: int) -> int:
            <FILL_ME>
            return fib(x-1) + fib(x-2)
    '''
>>> generator = pipeline(model="codellama/CodeLlama-7b-hf")
>>> generator(PROMPT, do_sample=False)
[{'generated_text': "\ndef fib(x: int) -> int:\n\tif x == 0:\n\t\treturn 0\n\tif x == 1:\n\t\treturn 1\n\telse:\n\t\treturn fib(x-1) + fib(x-2)\n"}]

Should I ignore this?

tanaymeh avatar Jan 26 '24 12:01 tanaymeh

you can probably ignore it with # doctest: +SKIP

ArthurZucker avatar Jan 30 '24 00:01 ArthurZucker

Also cc @Rocketknight1

ArthurZucker avatar Jan 30 '24 00:01 ArthurZucker

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Thanks for the detailed review, @ArthurZucker!

I also noticed that my auto-formatting setting in Black has made some unwanted changes in the text_generation.py file. Also reverting that file to the original branch version.

tanaymeh avatar Jan 30 '24 12:01 tanaymeh

No worries, ping me whenever for another review!

ArthurZucker avatar Jan 31 '24 01:01 ArthurZucker

@ArthurZucker About your suggestion on merging the CodeLlama FIM logic with other model variants: It would complicate things a lot in the pipeline code as it will require a lot of code from the CodeLlama tokenizer to be included directly in the pipeline code since the tokenizer deals with much of the FIM related operations in the tokenize() and other functions.

Keeping them separate (i.e; For models that support a <FILL_ME> token out of the box vs. those that don't support it like starcoder) will be easier to debug and maintain and will remove redundancy.

FYI, I am thinking of something like this:

The supported model dict will have all model names, corresponding FIM tokens and if they support Out of the box Infilling or not (if they don't, it will be None in place of infill token)

SUPPORTED_MODELS = {
    "bigcode/starcoder": ("<fim_prefix>", "<fim_middle>", "<fim_suffix>", None),
    "codellama/CodeLlama-7b-hf": ("▁<PRE>", "▁<MID>", "▁<SUF>", "<FILL_ME>"),
    # other models
}

...

infill_token = self.SUPPORTED_MODELS[self.model.name_or_path]

if not infill_token:
    # Extract prefix and suffix
    input_prefix, input_suffix = self.extract_prefix_suffix(prompt_text, infill_token)

    if mode == "psm":
        prompt_text = (
            self.DEFAULT_PREFIX_TOKEN
            + input_prefix
            + self.DEFAULT_SUFFIX_TOKEN
            + input_suffix
            + self.DEFAULT_MIDDLE_TOKEN
        )
    else:
        prompt_text = (
            self.DEFAULT_SUFFIX_TOKEN
            + input_suffix
            + self.DEFAULT_PREFIX_TOKEN
            + input_prefix
            + self.DEFAULT_MIDDLE_TOKEN
        )
else:
    # Tokenizer directly, since the Infill token is supported out of the box and doesn't need any token re-arrangement
    ...

Please let me know your opinion on this!

tanaymeh avatar Feb 07 '24 06:02 tanaymeh

IMO that is exactly the purpose of this pipeline. The functions should not necessarily have been part of the tokenizer as they are only need for the FIM task. So I am more saying let's not rely on any tokenizer, but handle the processing in the pipeline

ArthurZucker avatar Feb 13 '24 03:02 ArthurZucker

IMO that is exactly the purpose of this pipeline. The functions should not necessarily have been part of the tokenizer as they are only need for the FIM task. So I am more saying let's not rely on any tokenizer, but handle the processing in the pipeline

I see, thanks for clarifying, will push the updates soon!

tanaymeh avatar Feb 13 '24 07:02 tanaymeh

@ArthurZucker I added the naming and init file changes as you requested. Working on the tests by taking the text-generation tests as a reference.

One doubt: The smallest model that supports FIM (that I know of) is codellama-7b which is still a pretty huge model and should use GPU for testing. Is there a way to do the tests without using such a big model?

tanaymeh avatar Feb 19 '24 14:02 tanaymeh

You should be able to use 4bit quantization!

ArthurZucker avatar Feb 20 '24 04:02 ArthurZucker

@ArthurZucker Once the generation config thing is clear, I will start adding tests.

Here are the tests I am thinking of adding:

  1. test_model - Tests a small codellama-7b model on the Fibonacci sequence problem
  2. test_infill_mode - Tests the tokenization for different infill modes
  3. test_missing_infill_token - Tests if the model throws an error when encountering a missing token in the prompt text
  4. test_multiple_infill_tokens - Tests if the model throws an error when multiple infill tokens are provided in the prompt text
  5. test_return_text - Tests if the model returns full text or only infilled text

How does this sound?

tanaymeh avatar Feb 27 '24 13:02 tanaymeh

Sounds good. No need for the generation config update . Tokens are string so should be saved in the tokenizer_config.json IMO

ArthurZucker avatar Feb 29 '24 10:02 ArthurZucker

Sounds good. No need for the generation config update . Tokens are string so should be saved in the tokenizer_config.json IMO

@ArthurZucker But for that, we need to save them to the tokenizer_config.json first for the tests to pass right?

tanaymeh avatar Mar 11 '24 22:03 tanaymeh

Yes, you can probably open a PR to the models and use the revision! WDYT?

ArthurZucker avatar Mar 25 '24 06:03 ArthurZucker

Hi @ArthurZucker, I was out during easter and will work on this now. Sorry for the delay!

tanaymeh avatar Apr 05 '24 19:04 tanaymeh

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 30 '24 08:04 github-actions[bot]