WuJinxuan
WuJinxuan
完成了layernorm的arm部分: 1. neon:pack1,pack4 2. fp16s:pack1,pack4,pack8 3. fp16sa:pack1,pack4,pack8 4. bf16s:pack1,pack4 (PS:犀牛鸟计划的工作)
现在的arm的multiheadattention只有我好几个月前pr的neon fp32 pack4的实现,这次pr把剩下的补齐: 1. fp32 pack1 2. fp16s pack1/4/8 3. fp16sa pack1/4/8 4. bf16s pack1/4 & naive
## detail | 详细描述 | 詳細な説明 如下述代码: unfold_a = F.unfold(input_a, kernel_size=4, stride=4).permute(0, 2, 1) unfold_b = F.unfold(input_b, kernel_size=4, stride=4) output = torch.matmul(unfold_a, unfold_b)