kernl icon indicating copy to clipboard operation
kernl copied to clipboard

Blank output in the Inference while using a customize trained T5 model

Open gaurav21s opened this issue 2 years ago • 10 comments

Hi, When I am using the normal version of the t5 models like 't5-small' or 't5-base'. It is working fine and I am getting the output. But when I tried my customized trained t5 base model for my data. After the kernl optimization, the output is blank.

Can you please look into that or do you have any idea about that?

gaurav21s avatar Jan 02 '23 06:01 gaurav21s

Hi @gaurav21s

By blank, you mean all the outputs are empty-zeroed tensors ? Do you have any warning / error messages ?

jonathlela avatar Jan 05 '23 10:01 jonathlela

Hi Jonathan, By blank means, it was giving blank space in the output.

On Thu, Jan 5, 2023 at 3:46 PM Jonathan Marchand @.***> wrote:

Hi @gaurav21s https://github.com/gaurav21s

By blank, you mean all the outputs are empty-zeroed tensors ? Do you have any warning / error messages ?

— Reply to this email directly, view it on GitHub https://github.com/ELS-RD/kernl/issues/226#issuecomment-1372026014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQKIAQZQLV3HAFYCQIPTHCLWQ2NOPANCNFSM6AAAAAATOR6YKI . You are receiving this because you were mentioned.Message ID: @.***>

gaurav21s avatar Jan 05 '23 12:01 gaurav21s

I was running the inference for testing and the tokenizer was working fine but when I was getting output after the decode it was just blank space.

On Thu, Jan 5, 2023 at 6:22 PM Gaurav Shrivastav < @.***> wrote:

Hi Jonathan, By blank means, it was giving blank space in the output.

On Thu, Jan 5, 2023 at 3:46 PM Jonathan Marchand @.***> wrote:

Hi @gaurav21s https://github.com/gaurav21s

By blank, you mean all the outputs are empty-zeroed tensors ? Do you have any warning / error messages ?

— Reply to this email directly, view it on GitHub https://github.com/ELS-RD/kernl/issues/226#issuecomment-1372026014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQKIAQZQLV3HAFYCQIPTHCLWQ2NOPANCNFSM6AAAAAATOR6YKI . You are receiving this because you were mentioned.Message ID: @.***>

gaurav21s avatar Jan 05 '23 12:01 gaurav21s

Can confirm this problem exists also in e2e t5 tutorial, if one changes t5-small to t5-base, the output of .generate in the last cell (after optimization) would all be 0.

For environment, I used this repo's Dockerfile to create VScode dev container inside Windows wsl2 with RTX3090-24GB.

wangjunhaoumich avatar Jan 07 '23 23:01 wangjunhaoumich

Thanks for your reports, I can confirm this bug and we're investigating it.

Simple code to reproduce it:

import torch                                                                                             
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer                                            
                                                                                                         
from kernl.model_optimization import optimize_model                                                      
                                                                                                         
                                                                                                         
model_name = "t5-base"                                                                                   
                                                                                                         
model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_model_name_or_path=model_name).eval().cuda()    
                                                                                                         
tokenizer = AutoTokenizer.from_pretrained(model_name)                                                    
                                                                                                         
input_ids = tokenizer(                                                                                   
    "translate English to French: The house in the woods is wonderful, can we buy it ?",                 
    return_tensors="pt",                                                                                 
    pad_to_multiple_of=8,                                                                                
    padding=True,                                                                                        
).to("cuda")                                                                                             
                                                                                                         
optimize_model(model.encoder)                                                                            
optimize_model(model.decoder)                                                                            
                                                                                                         
with torch.inference_mode(), torch.cuda.amp.autocast(enabled=True, dtype=torch.float16, cache_enabled=Tr\
ue):                                                                                                     
    output = model.generate(input_ids["input_ids"], min_length=22, max_length=22,)                       
    print(output[0])                                                                                     
    print(tokenizer.decode(output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)) 

displays:

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       device='cuda:0')

Disabling replace_layer_norm_rms fix the output

jonathlela avatar Jan 11 '23 08:01 jonathlela

@jonathlela can you share a reproduction at the rmsnorm kernel level ?

Simple analysis (local machine, RTX 3090) seems to show that the input of rmsnorm kernel contains NaN values in large T5 flavour. It happens even WITHOUT Kernl optimizations. With base model, the model seems to work (the output is correct). @jonathlela to print intermediate values, don't forget to remove CUDA graphs wich doesn't like print() or raise Exception (it will segfault).

@gaurav21s how did you trained T5? With fp16 or bf16/fp32 ? If trained in BF16 I would not be surprised if some tensors are outside of the FP16 range.

T5 base output:

/home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
tensor([    0,   325,  4053,   247,   110,  5912,   259, 27015,  1074,     6,
         3215,   106,     7,    18, 10529,     3,    40,    31, 13541,     3,
           58,     3], device='cuda:0')
La maison dans les bois est merveilleuse, pouvons-nous l'acheter? 

T5 large:

❯ python t5.py
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.21k/1.21k [00:00<00:00, 911kB/s]
/home/geantvert/.local/share/virtualenvs/kernl/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       device='cuda:0')

pommedeterresautee avatar Jan 14 '23 21:01 pommedeterresautee

@jonathlela I have the same issue with optimized t5-base. You mentioned that disabling "replace_layer_norm_rms" fixes the issue. How replace_layer_norm_rms can be disabled?

I've tried commenting out this line and re-running the code: https://github.com/ELS-RD/kernl/blob/91530e66e72c023c475e6f02ae2cbc5f7122d0e2/src/kernl/optimizer/dynamo_backend.py#L33 but still got empty output.

Pavelrst avatar Jan 24 '23 09:01 Pavelrst

T5 weights are in BF16, triton 2.0 does not support fully BF16, we're waiting for the fix to be propagated. https://github.com/openai/triton/pull/1306

jonathlela avatar Mar 23 '23 08:03 jonathlela

Hello @jonathlela

https://github.com/openai/triton/pull/1306 as the PR is merged, would using most recent openai triton with kernl resolve this issue?

hchoi-moveworks avatar Mar 31 '23 03:03 hchoi-moveworks

Hello @jonathlela

Would there be other large models we can try with Kernl?

It seems like larger version of T5 model type does not work due to this issue. Also GPT model also seems to have errors: https://github.com/ELS-RD/kernl/issues/146

Could we try GPT model? Is the above issue fixed?

hchoi-moveworks avatar Apr 11 '23 20:04 hchoi-moveworks