FasterTransformer
FasterTransformer copied to clipboard
masked_multihead_attention_kernel of size_per_head is 32 is faster than masked_multihead_attention_kernel of size_per_head is 64
Description
gpt2 and size_per_head*head==1024, input token num is 1792 output token num is 2048
layer. size_per_head is 32. size_per_head is 64
12 550ms 630ms
24 980ms 1130ms
48 1900ms 2200ms
Reproduced Steps
dddd
[ft_instance_hyperparameter] max_batch_size=1 ; Use for allocate the buffer max_seq_len=2048 ; The sequence length of position embedding table, should move to model hyper-parameter beam_width=4 ; beam width for beam search top_k=4 ; k value for top k sampling top_p=1 ; p value for top p sampling temperature=1.0 ; Use for sampling repetition_penalty=2.0 ; Use for sampling len_penalty=0.0 beam_search_diversity_rate=0.0 data_type=fp16 sparse=0
model_name=self_defined model_dir=models/openai-gpt-models/c-model/124m/1-gpu/ shared_contexts_ratio=1.0
[request] request_batch_size=1 ; determine by the request request_output_len=256 ; determine by the request return_log_probs=false ; return the output log probs and cumulative log probs. context_log_probs=false ; include input contexts in the cumulative log probability computation.
[gpt_124M] head_num=12 size_per_head=64 vocab_size=50257 decoder_layers=12
[self_defined] head_num=16 size_per_head=64 vocab_size=30000 decoder_layers=12
If you mean that under the constraint size_per_head * head_num = 1024
, the latencies of size_per_head
32 and 64 are little different, this is an expected behavior because their gemm shape of attention are different.
Only the masked_multihead_attention_kernel. and other kernel, 64 is faster than 32 .
For other kernels, if you keep size_per_head * head_num = 1024
, they should be same.
For MHA, this is an expected result.
Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.