transformers icon indicating copy to clipboard operation
transformers copied to clipboard

input_ids_seq_length is always 1

Open ChrisSpraaklab opened this issue 1 year ago • 9 comments

System Info

  • transformers version: 4.26.1
  • Platform: Linux-5.4.0-113-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): 2.8.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.5.0 (gpu)
  • Jax version: 0.3.13
  • JaxLib version: 0.3.10
  • Using GPU in script?: yes

I am trying to generate output that is equal in length to the input (partially to avoid hallucinations and repetitions). In src/transformers/generation/utils.py I read how input length is determined: If self.config.is_encoder_decoder (which is the case for me), input_ids_seq_length calculates the length of the input ids coming from _prepare_decoder_input_ids_for_generation, which makes a tensor with dimension (batch_size, 1) filled with start_tokens. This means the input_ids_seq_length is always 1, making it useless for determining the input length (and determining the output length based on that).

Who can help?

@sgugger @muellerzr

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

The problem arises in a script of my own, but this example also highlights it: (the task I am working on is not summarization but grammar correction, thats why I want the input length to be equal to the output length)

from transformers import AutoTokenizer, T5ForConditionalGeneration, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
config = GenerationConfig(max_new_tokens=0)

input_ids = tokenizer("summarize: My friends are cool but they eat too many carbs.", return_tensors="pt").input_ids
outputs = model.generate(input_ids, generation_config=config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Expected behavior

I would expect the output length to be determined by the input length + max_new_tokens: generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length

This is the case, but input_ids_seq_length is (wrongfully) always 1, making the output length independent of the input and equal to max_new_tokens+1.

ChrisSpraaklab avatar Mar 10 '23 12:03 ChrisSpraaklab

cc @gante

sgugger avatar Mar 10 '23 12:03 sgugger

Hey @ChrisSpraaklab 👋 In both types of models, input_ids_seq_length is relative to the output of the model, which is different for encoder-decoder (does not contain the prompt) and decoder-only models (contains the prompt). I agree that we might benefit from a rework there, for clarity :)

In any case, let's sort out your immediate issue! As the argument indicates, max_new_tokens will make the model generate up to max_new_tokens new tokens. As such, if you want to generate an output equal to the input, you'll have to set max_new_tokens=input_ids.shape[1].

Also, bear in mind that encoder-decoder models ALWAYS start the output with a BOS token. As such, the length of the output will be the length of the input + 1.

gante avatar Mar 10 '23 13:03 gante

@gante Thanks for your quick response. However, what I mean is that when input_ids_seq_length is set to input_ids.shape[-1], this value is always equal to 1 (as it comes from _prepare_decoder_input_ids_for_generation).

# 5. Prepare `input_ids` which will be used for auto-regressive generation
        if self.config.is_encoder_decoder:
            input_ids = self._prepare_decoder_input_ids_for_generation(
                batch_size,
                decoder_start_token_id=generation_config.decoder_start_token_id,
                bos_token_id=generation_config.bos_token_id,
                model_kwargs=model_kwargs,
                device=inputs_tensor.device,
            )
        else:
            input_ids = inputs_tensor if model_input_name == "input_ids" else model_kwargs.pop("input_ids")

        # 6. Prepare `max_length` depending on other stopping criteria.
        input_ids_seq_length = input_ids.shape[-1]
        has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
        if has_default_max_length and generation_config.max_new_tokens is None:
            warnings.warn(
                f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
                "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
                " recommend using `max_new_tokens` to control the maximum length of the generation.",
                UserWarning,
            )
        elif generation_config.max_new_tokens is not None:
            generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
            if not has_default_max_length:
                logger.warn(
                    f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
                    f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
                    "Please refer to the documentation for more information. "
                    "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
                    UserWarning,
                )

In my understanding, doing as you suggested would make this line equivalent to 1+1, as max_new_tokens=input_ids.shape[1] (equal to 1) and input_ids_seq_length = input_ids.shape[-1] (equal to 1)

generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length

ChrisSpraaklab avatar Mar 10 '23 14:03 ChrisSpraaklab

@ChrisSpraaklab inside generate, in encoder-decoder models like T5, input_ids is related to the decoder input ids. They are not the same as the input_ids you feed to .generate(), which will be used inside the encoder. Sadly, because .generate() is used with many types of models, we have this naming clash :)

Have you tried running

from transformers import AutoTokenizer, T5ForConditionalGeneration, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("summarize: My friends are cool but they eat too many carbs.", return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=input_ids.shape[1])
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

?

gante avatar Mar 10 '23 14:03 gante

Thanks! Your solution does indeed produce the result I was looking for. I was just quite confused about the naming convention and documentation around max_new_tokens. I was under the impression that its value would be added to the length of in the input of the encoder, not the decoder. However, I now understand why it doesn't behave as I expected it to.

ChrisSpraaklab avatar Mar 10 '23 15:03 ChrisSpraaklab

So... despite that we input a token sequence input_ids in the generate() function, the length of this is irrelevant in the encoder-decoder model, and the max_new_tokens in generate() only refers to the length of the decoder input, which, because of BOS, is always 1 in our case. Yes, this is somewhat confusing indeed.

Are there ways to motivate generate() to be more concise, but still run until EOS is generated, e.g., by setting a prior on the EOS?

davidavdav avatar Mar 10 '23 15:03 davidavdav

Hey @davidavdav -- yeah, you can try using Beam Search (i.e. num_beams>1) and pass a NEGATIVE length_penalty. This will nudge the output towards shorter outputs!

gante avatar Mar 10 '23 15:03 gante

BTW, if you come across better variable names, by all means, please suggest them :) We have so many features on our to-do list (including better docs) that every little help is precious!

gante avatar Mar 10 '23 15:03 gante

Ah thanks, @gante---I do appreciate the difficulty of choosing sensible parameter/variable names, the number of times I am refactoring names back and forth in my own code is quite scary!

davidavdav avatar Mar 10 '23 15:03 davidavdav

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 09 '23 15:04 github-actions[bot]