volo
volo copied to clipboard
Some thoughts about volo
Thanks for your great work! After reading the paper, I have a question: Can I think volo as a "pixel-wise conditional conv" network?
The reasons are:
- The weighted average and fold operations together in Fig. 2 are actually a conv operation, except the "conv kernel" is generated from the outlook attention.
- The outlook attention, i.e.
C -> k**4
operation, can be viewed as generating "conv kernel" for allHxW
pixels
Combining these two points, I think volo is really like a "pixel-wise conditional conv" network.