FasterTransformer T5 Beam Search Answer wrong

trafficstars

Branch/Tag/Commit

main

Docker Image Version

main

GPU name

A100-80G

CUDA Driver

11.4

Reproduced Steps

Hi, I want to test FT-version MT5-small's infer time, I use greedy search and get all the answer, it's totally right, then I change the mode to beam search, and my code is as below:
Here, I met some questions:
1) the beam_width can't be 5, then I change it to 4, and it's right, is beam size must be  even? And what is the meaning of beam_width?
2) the answer of FT-version MT5-small's beam size is wrong, just like what I did, beam_width is 4, and I  print the answer results, it is just this(Each line is a separate answer):
'''
'', 'Extract key', 'Extract key phrase', 'Extract key phrases: rolling hills of the palouse elk river', '', 'Extract key', 'Extract key phrases: pampered', 'Extract key phrases: pampered chef', 
'', 'Extract key', 'Extract key phrase', 'Extract key phrases: pingdom canary string', 
'', 'Extract key', 'Extract key phrases: panama city', 'Extract key phrases: panama city beach', 
'', 'Extract key', 'Extract key phrases: panda express', 'Extract key phrases: panda restaurant group', 
'', ‘Extract key', 'Extract key phrases: pala casino', 'Extract key phrases: pala interactive'
'''
And the model HF generated is right, It's just like this(Each line is a separate answer):
'''
Extract key phrases: اعدادات الراوتر', 'Extract key phrases: صفحه اعدادات الراوتر', 'Extract key phrases: الدخول الي الراوتر', 'Extract key phrases: واجهه الراوتر', 

'Extract key phrases: fina isl', 'Extract key phrases: fina swimming world cup', 'Extract key phrases: international swimming league', 'Extract key phrases: fina isl dispute', 

'Extract key phrases: diaminopyridine', 'Extract key phrases: diaminopyridine phosphate', 'Extract key phrases: amifampridine phosphate', 'Extract key phrases: pyridinediamine',

'Extract key phrases: diaminopyridine', 'Extract key phrases: in patients with orthostatic hypotension and postural tachycardia syndrome',  'Extract key phrases: orthostatic hypotension and postural tachycardia syndrome', 'Extract key phrases: diaminopyridine 3',
'''
Could you please help me find out where the problem is?

The code below  reference to summarization.py and I added DataLoader etc and time cost calc etc, so it's a little different from ori's one, but they are roughly the same, the code is below:

beam_width = 4
for i, (x1, x2) in enumerate(eval_loader):
        with torch.no_grad():
            output, ft_output_len = ft_t5(x1,
                                        None,
                                        beam_width,
                                        output_len,
                                        top_k,
                                        0.0,
                                        beam_search_diversity_rate=0.0,
                                        is_return_output_log_probs=False,
                                        is_return_cum_log_probs=False)

        if beam_width != 1:
            output = np.expand_dims(output.reshape(output.shape[0]*beam_width, output_len), axis=1)
            ft_output_len = np.expand_dims(ft_output_len.reshape(-1), axis=1)
        output_lines = tokenizer.batch_decode(output[:,0], skip_special_tokens=True)
        print(output_lines)
        # output_lines = []
        # for j in range(len(output[:, 0])):
        #     output_lines.append(tokenizer.decode(output[j][0][:ft_output_len[j][0]], skip_special_tokens=True))

        b = datetime.datetime.now()
        time_con = (b-a).microseconds / 1000
        time_all.append(time_con)
        res = np.array([r.strip() for r in output_lines])
        
        if beam_width > 1:
            res = res.reshape(-1, beam_width).tolist()

Oct 19 '22 02:10 Tangzixia

sorry, here topk=1

Oct 19 '22 02:10 Tangzixia

You can ask the case you need at https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/online_softmax_beamsearch_kernels.cu#L579.
Can you print the shape of output directly? It should be [bs, beam_width, output length].

Oct 19 '22 03:10 byshiue

You can ask the case you need at https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/online_softmax_beamsearch_kernels.cu#L579.

Can you print the shape of output directly? It should be [bs, beam_width, output length].

Sure, for case 1, I'll see the link, thanks. for case 2, I print the shape of output and ft_output_len , their shape are (32, 4, 32) and (32, 4) seperately, and I set batch_size=32, beam_width=4, and output_len = 4, it seems there is no problem here.

Oct 19 '22 05:10 Tangzixia

For 2, is it still a problem now?

Oct 19 '22 06:10 byshiue

2) the answer of FT-version MT5-small's beam size is wrong, just like what I did, beam_width is 4, and I  print the answer results, it is just this(Eac

yeah, the parameters of Beam Search is just like translate_example, is there a similar phenomenon in tasks such as translation? Thanks a lot. It just seems the answer of FT-version MT5-small's beam size is wrong now, no reason has been found yet. Can you help me? Thank you!

Oct 19 '22 06:10 Tangzixia

So, do you mean that the shape is correct but the results are wrong?

Can you provide a example to run benchmark on both FT and HF such that we can see accuracy gap between them?

Oct 19 '22 06:10 byshiue

So, do you mean that the shape is correct but the results are wrong?

Can you provide a example to run benchmark on both FT and HF such that we can see accuracy gap between them?

yeah, sure, I'll give a compare, just a moment!

Oct 19 '22 06:10 Tangzixia

So, do you mean that the shape is correct but the results are wrong? Can you provide a example to run benchmark on both FT and HF such that we can see accuracy gap between them?

yeah, sure, I'll give a compare, just a moment!

Hi, I add a comparative experiments: FT and HF version. the code is show below:

N = args.batch_size
a = datetime.datetime.now()
for i, (x1, x2) in enumerate(eval_loader):
   with torch.no_grad():
       output, ft_output_len = ft_t5(x1,
                                   None,
                                   beam_width,
                                   output_len,
                                   top_k,
                                   0.0,
                                   beam_search_diversity_rate=0.0,
                                   is_return_output_log_probs=False,
                                   is_return_cum_log_probs=False)
   if beam_width != 1:
       output = np.expand_dims(output.reshape(output.shape[0]*beam_width, output_len), axis=1)
       ft_output_len = np.expand_dims(ft_output_len.reshape(-1), axis=1)
   output_lines = tokenizer.batch_decode(output[:,0], skip_special_tokens=True)
   print(output_lines)
   print("-------------------------")    
   with torch.no_grad():
       output = model.generate(x1["input_ids"].cuda(),
                               max_length=output_len + 1,
                               do_sample=False,
                               top_k=top_k,
                               num_beams=beam_width,
                               eos_token_id=tokenizer.eos_token_id,
                               pad_token_id=tokenizer.pad_token_id,
                               num_return_sequences=beam_width)
   decode_labels = tokenizer.batch_decode(output, skip_special_tokens=True)
   print(decode_labels)
   print("++++++++++++++++++++++++")

The answer are shown seperately below, the first answer and second answer are FT version and HF version. In contrast, the effects of the two versions are indeed inconsistent. Please help me to see if it's a call problem or a problem with the underlying implementation of beamsearch.

++++++++++++++++++++++++ ['', 'Extract key', 'Extract key phrases: اعدادات الراوتر', 'Extract key phrases: الدخول الي الراوتر']

['Extract key phrases: اعدادات الراوتر', 'Extract key phrases: صفحه اعدادات الراوتر', 'Extract key phrases: الدخول الي الراوتر', 'Extract key phrases: الدخول الي صفحه الراوتر'] ++++++++++++++++++++++++

Oct 19 '22 06:10 Tangzixia

We don't guarantee the results would be fully same, we only check do them provide same quality because it is easy to affected by cumulative difference in generation model.

Oct 19 '22 06:10 byshiue

We don't guarantee the results would be fully same, we only check do them provide same quality because it is easy to affected by cumulative difference in generation model.

Yeah, I know, cause if I use greedy search, the answer of FT and HF are almost same, but if I choose beam search, there is a big inconsistency between the answers of FT and HF, and it seems that there are some regularities(containment relationship, the answer below contains the answer above) in FT version. If the results is caused by cumulative difference in generation model, the results in greedy search will be a big difference between FT and HF, and I didn't observe the same phenomenon(the answer of FT and HF is very different) in greedy search. Can you help me see if the FT version of the calling method is correct if I use ori beam search(no top_k/top_p sampling)? Here top_k and top_p are separately set 0 and 0.0, I don't see there exists a big difference compared to other task calls, so I suspect there exists a problem with beam search implementation in online_softmax_beamsearch_kernels.cu, can you help me? Thanks a lot.

Oct 19 '22 07:10 Tangzixia

If the results is caused by cumulative difference in generation model, the results in greedy search will be a big difference between FT and HF,

How do you get this conclusion?

Can you help me see if the FT version of the calling method is correct if I use ori beam search(no top_k/top_p sampling)?

I don't see any issue.

so I suspect there exists a problem with beam search implementation in online_softmax_beamsearch_kernels.cu, can you help me? Thanks a lot.

We will check again when we have bandwidth. But because all components share same function, we think the program should work correctly.

Oct 19 '22 07:10 byshiue

I don't see any issue.

Thanks for your quick reply, for this comment: "If the results is caused by cumulative difference in generation model, the results in greedy search will be a big difference between FT and HF", I think there may be two reasons for this.

For Reason 1, the difference between beam search and greedy search are their decoding strategy, and with greedy search strategy, the FT and HF are almost same, If cumulative difference in generation model caused the difference, It should cause a big difference in greedy search.

And For reason 2, It's the decoding results of beam search in FT, the results are almost like this(here top_k is set to 8): prediction: ['', 'Extract', 'Extract key', 'Extract key phrase', 'Extract key phrases: what is', 'Extract key phrases: what is lambda', 'Extract key phrases: what is lambda control', 'Extract key phrases: what is lambda control in ships'] label: ['lambda control']

It seems that the beam search result of the FT version has an inclusion relationship, and the latter result includes the former, it's very regular, and if it is caused by cumulative difference in generation model, the beam search's different results should be not regular.

To sum up, I suspect that the implementation of beam search is not quite right. Would you help me check the implementation of beam search, Thanks.

Oct 19 '22 07:10 Tangzixia

Hi, did you encounter the same situation during the test stage? this is more important to our project, If possible, please help me check the code or test the beam search code during FasterTransformer and HuggingfaceTransformer again, thanks a lot.

Oct 21 '22 02:10 Tangzixia

We still not find any bug yet.

Oct 21 '22 02:10 byshiue

We still not find any bug yet.

Sorry, I want to know that when you call beam_search decoding strategy, is the effect of fastertransformer and huggingface transformer the same. If you didn't encounter this problem, can you post your calling code (the calling code of fastertransformer and huggingface transformer), thank you very much

Oct 21 '22 02:10 Tangzixia

We still not find any bug yet.

Sorry, I want to know that when you call beam_search decoding strategy, is the effect of fastertransformer and huggingface transformer the same. If you didn't encounter this problem, can you post your calling code (the calling code of fastertransformer and huggingface transformer), thank you very much

I think that if you don't meet this issue, maybe it's the calling method of mine is wrong, I would like to refer to your calling code, thanks!

Oct 21 '22 02:10 Tangzixia

Hi, would you like to post your calling code (the calling code of fastertransformer and huggingface transformer), I'll refer to it and check my script again, thank you very much!

Oct 24 '22 06:10 Tangzixia

Can you try to evaluate the scores of different beam_idx of FT? We find that the score of beam_idx = beam_width - 1 is better than the score of beam_idx = 0 in this task. But in other tasks, beam_idx = 0 provides best score. We are still investigating the reason.

Oct 27 '22 09:10 byshiue

In FT, it only supports beam search with early_stopping = True cases (in HF's language). So, if you want to compare HF and FT, you should set early_stopping=True in HF.

The following are comparison between different cases under beam_width = 4.

The values of four beams on HF when early_stopping=False:

Hugging Face (total latency: 12.587063 sec)
beam_id: 0
rouge1 : 4.219197793278645
rouge2 : 0.38353459406090984
rougeL : 3.8142499443068063
rougeLsum : 3.82312965217383

beam_id: 1
rouge1 : 5.022445044753288
rouge2 : 0.4987118145012882
rougeL : 4.587195717767148
rougeLsum : 4.594502569574494

beam_id: 2
rouge1 : 4.949093721394631
rouge2 : 0.4987118145012882
rougeL : 4.531974941532189
rougeLsum : 4.541734696830083

beam_id: 3
rouge1 : 3.4808567069593046
rouge2 : 0.38353459406090984
rougeL : 3.1922376760400972
rougeLsum : 3.2133808983009398

The values of four beams of HF when early_stopping=True:

Hugging Face (total latency: 6.968048 sec)
beam_id: 0
rouge1 : 3.2582749635995683
rouge2 : 0.43071847507331373
rougeL : 2.999982637116015
rougeLsum : 2.951565697359031

beam_id: 1
rouge1 : 3.9398344978800486
rouge2 : 0.5102639296187683
rougeL : 3.7955226264201865
rougeLsum : 3.7348855735014452

beam_id: 2
rouge1 : 3.2820790558219377
rouge2 : 0.5176850798970615
rougeL : 3.157003761136054
rougeLsum : 3.160429616605878

beam_id: 3
rouge1 : 2.6857632712956874
rouge2 : 0.5225726654298083
rougeL : 2.522481435313913
rougeLsum : 2.541316445556804

The values of four beams of FT:

Faster Transformers (total latency: 2.7281880000000003 sec)
beam_id: 0
rouge1 : 2.9727503947465617
rouge2 : 0.5176850798970615
rougeL : 2.819453438730627
rougeLsum : 2.8042071542664275

beam_id: 1
rouge1 : 3.2943592112742546
rouge2 : 0.5151515151515151
rougeL : 3.0032731135505633
rougeLsum : 2.9964696270762348

beam_id: 2
rouge1 : 4.660539782049569
rouge2 : 0.5153576582148012
rougeL : 4.263813957813543
rougeLsum : 4.228680118654764

beam_id: 3
rouge1 : 4.598939851056436
rouge2 : 0.6754138481933445
rougeL : 3.9422075185488894
rougeLsum : 3.8983985809254715

Oct 28 '22 03:10 byshiue

Hi, would you like to post your calling code (the calling code of fastertransformer and huggingface transformer), I'll refer to it and check my script again, thank you very much!

Has this been resolved?

Mar 29 '23 03:03 shiqingzhangCSU

Has this been solved ? we try to use faster transformer to inference on T5 like models. And we find that even use early_stop = True, still get a different result between hf and faster transformer

Jul 14 '23 02:07 yuanxiaoyu1

多大的模型，似乎生成式模型HF和FT就是会不一样，你不一样的token多吗？

Jul 14 '23 05:07 shiqingzhangCSU

The parameters of the model are about 100 million，limited by our usage scenarios, we use sentence accuracy, which is about 5-7% worse

@.> 时间： 2023年7月14日 (周五) 13:58 主题： Re: [NVIDIA/FasterTransformer] T5 Beam Search Answer wrong (Issue #349) @.> @.>, @.>

多大的模型，似乎生成式模型HF和FT就是会不一样，你不一样的token多吗？ — Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/FasterTransformer/issues/349#issuecomment-1635319113, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYGAJWXGUOFBX2RQX2RMZRDXQDNY3ANCNFSM6AAAAAARIU35QA. You are receiving this because you commented.[image: https://github.com/notifications/beacon/AYGAJWQNCGPW7BNKU6X3R5LXQDNY3A5CNFSM6AAAAAARIU35QCWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTBPD6US.gif]Message ID: @.***>

Jul 14 '23 06:07 yuanxiaoyu1

_> In FT, it only supports beam search with early_stopping = True cases (in HF's language). So, if you want to compare HF and FT, you should set early_stopping=True in HF.

Hi, it would be really helpful if you could elaborate a bit on this. Why does FT only support early_stopping = True cases? Is there any way that we can adapt that? Since having it set to false is crucial to our model.

Jul 14 '23 06:07 dina21c

Hi, I also encountered the same issue. With beam_size=1, the results of HF and FT are exactly same. However, with beam_size=5, 10% is different in terms of sentence-level accuracy. I suspect the implementation of beam search is slightly different between HF and FT.

Sep 05 '23 04:09 zhouku92

FasterTransformer FasterTransformer copied to clipboard

T5 Beam Search Answer wrong

Branch/Tag/Commit

Docker Image Version

GPU name

CUDA Driver

Reproduced Steps

++++++++++++++++++++++++ ['', 'Extract key', 'Extract key phrases: اعدادات الراوتر', 'Extract key phrases: الدخول الي الراوتر']

FasterTransformer
FasterTransformer copied to clipboard