torchscale
torchscale copied to clipboard
About running speed
Thanks for your excellent work! I have mentioned that torchscale serially executes the operation of mapping x to q, k, and v, in line 84~86 in file torchscale/component/multihead_attention.py. Will this be slower in your approach compared to doing it in parallel? For example, self.qkv_proj=nn.Linear(embed_dim, 3 * embed_dim)