FasterTransformer masked_multihead_attention_kernel of size_per_head is 32 is faster than masked_multihead_attention_kernel of size_per

masked_multihead_attention_kernel of size_per_head is 32 is faster than masked_multihead_attention_kernel of size_per_head is 64

Open p517332051 opened this issue 2 years ago • 4 comments

Description

gpt2 and size_per_head*head==1024,  input token num is 1792 output token num is 2048
layer.  size_per_head is 32.    size_per_head is 64
12	550ms	                630ms
24	980ms	                1130ms
48	1900ms	                2200ms

Reproduced Steps

dddd

Sep 16 '22 06:09 p517332051

[ft_instance_hyperparameter] max_batch_size=1 ; Use for allocate the buffer max_seq_len=2048 ; The sequence length of position embedding table, should move to model hyper-parameter beam_width=4 ; beam width for beam search top_k=4 ; k value for top k sampling top_p=1 ; p value for top p sampling temperature=1.0 ; Use for sampling repetition_penalty=2.0 ; Use for sampling len_penalty=0.0 beam_search_diversity_rate=0.0 data_type=fp16 sparse=0

model_name=self_defined model_dir=models/openai-gpt-models/c-model/124m/1-gpu/ shared_contexts_ratio=1.0

[request] request_batch_size=1 ; determine by the request request_output_len=256 ; determine by the request return_log_probs=false ; return the output log probs and cumulative log probs. context_log_probs=false ; include input contexts in the cumulative log probability computation.

[gpt_124M] head_num=12 size_per_head=64 vocab_size=50257 decoder_layers=12

[self_defined] head_num=16 size_per_head=64 vocab_size=30000 decoder_layers=12

Sep 16 '22 06:09 p517332051

If you mean that under the constraint size_per_head * head_num = 1024, the latencies of size_per_head 32 and 64 are little different, this is an expected behavior because their gemm shape of attention are different.

Sep 18 '22 23:09 byshiue

Only the masked_multihead_attention_kernel. and other kernel, 64 is faster than 32 .

Sep 19 '22 01:09 p517332051

For other kernels, if you keep size_per_head * head_num = 1024, they should be same.

For MHA, this is an expected result.

Sep 19 '22 01:09 byshiue

Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.

Dec 02 '22 14:12 byshiue

FasterTransformer FasterTransformer copied to clipboard

masked_multihead_attention_kernel of size_per_head is 32 is faster than masked_multihead_attention_kernel of size_per_head is 64

Description

Reproduced Steps

FasterTransformer
FasterTransformer copied to clipboard