transformers Question on model_max

System Info

- `transformers` version: 4.18.0
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.3
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.5.1 (False)
- Tensorflow version (GPU?): 2.4.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: N/A
- Using distributed or parallel set-up in script?: N/A

Who can help?

@LysandreJik @SaulLu

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm interested in finding out the max sequence length that a model can be run with. After some code browsing, my current understanding that this is a property stored in the tokenizer model_max_length.

I wrote a simple script to load a tokenzier for a pretrained model and print the model max length. This is the important part:

    # initialize the tokenizer to be able to print model_max_length
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast_tokenizer,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )

    logger.info(f"Model max length {tokenizer.model_max_length}")

I used this to print max seq length for models such as BERT, RoBERTa, etc. All with expected results. For DeBERTa, I get confusing results.

If I run my script with DeBERTA-v3 as follows:

python check_model_max_len.py --model_name microsoft/deberta-v3-large --output_dir ./tmp --cache_dir ./tmp/cache

I get Model max length 1000000000000000019884624838656

If I understand correctly, this is a large integer used for models that can support "infinite" size lengths.

If I run my script with --model_name microsoft/deberta-v2-xlarge, I get Model max length 512

I don't understand if this is a bug or a feature :) My understanding is that the main difference between DeBERTa V2 and V3 is the use of ELECTRA style discriminator during MLM pretraining in V3. I don't understand why this difference would lead to a difference in supported max sequence lengths between the two models.

I also don't understand why some properties are hardcoded in the python files, e.g.,

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
    "microsoft/deberta-v2-xlarge": 512,
    "microsoft/deberta-v2-xxlarge": 512,
    "microsoft/deberta-v2-xlarge-mnli": 512,
    "microsoft/deberta-v2-xxlarge-mnli": 512,
}

I would expect these to be in the config files for the corresponding models.

Expected behavior

I would expect the max supported lengths for DeBERTa-V2 and DeBERTa-V3 models to be the same. Unless, I'm missing something. Thanks for your help!

Apr 28 '22 18:04 ioana-blue

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 29 '22 15:05 github-actions[bot]

It's likely an error! Do you want to open a discussion on the model repo directly? https://huggingface.co/microsoft/deberta-v3-base/discussions/new

Jun 01 '22 11:06 LysandreJik

i get the same result 1000000000000000019884624838656

Jun 01 '22 11:06 yu-xiang-wang

I'm seeing the same for the 125m and 350m OPT tokenizers (haven't checked the larger ones):

>>> AutoTokenizer.from_pretrained("facebook/opt-350m")
PreTrainedTokenizer(name_or_path='facebook/opt-350m', vocab_size=50265, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True)})
>>> AutoTokenizer.from_pretrained("facebook/opt-125m")
PreTrainedTokenizer(name_or_path='facebook/opt-125m', vocab_size=50265, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True)})

Is this definitely a bug?

Jun 14 '22 16:06 donaghhorgan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 09 '22 15:07 github-actions[bot]

deberta v3 uses relative position embeddings which means it isn't limited to the typical 512 token limit.

As taken from section A.5 in their paper:

With relative position bias, we choose to truncate the maximum relative distance to k as in equation 3. Thus in each layer, each token can attend directly to at most (2k - 1) tokens and itself. By stacking Transformer layers, each token in the l-th layer can attend to at most (2k-1)*l tokens implicitly. Taking DeBERTa_large as an example, where k = 512, L = 24, in theory, the maximum sequence length that can be handled is 24,528.

That being said, it will start to slow down a ton once the sequence length gets bigger than 512

Jul 14 '22 03:07 nbroad1881

Yes, I thought this might be the case, however, the same is true for deberta v2 if I remember correctly and the answer for that is different. What I was asking in the original post is why the the difference between v2 and v3. Thanks for clarifying part of the question/answer.

Jul 14 '22 13:07 ioana-blue

I meant to add to my last post: The max length of 1000000000000000019884624838656 is typically an error when the max length is not specified in the tokenizer config file.

There was a discussion about it here: https://huggingface.co/google/muril-base-cased/discussions/1 And the solution was to modify the tokenizer config file: https://huggingface.co/google/muril-base-cased/discussions/2

Jul 14 '22 14:07 nbroad1881

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 07 '22 15:08 github-actions[bot]

This is still an issue with the config file and/or config file parser.

Mar 27 '23 20:03 bcdarwin

@bcdarwin

What is the issue?

Mar 27 '23 22:03 nbroad1881

transformers
transformers copied to clipboard

Question on model_max_length (DeBERTa-V3)

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers transformers copied to clipboard

Question on model_max_length (DeBERTa-V3)

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard