Oleg Zabluda

Results 47 comments of Oleg Zabluda

For one 4x4 block, F(2x2, 3x3), standard direct convolution uses (3 \* 3) \* (2 \* 2)=36 multiplications, and 6*4=24 additions, for the total 36+24=60 FLOP. Ignoring amortized filter [1],...

@andravin Aha! This is why you keep referring to only the number of multiplications in "arithmetic complexity reduction".

@soumith @rsdubtso > The numbers are not as impressive anymore (as expected). > Caffe + MKL - ~ 5100 ms > IntelCaffe + MKL - ~ 3052 ms > Speedup:...

@soumith, does it really matter? Typically only first iteration differs much from others.

@scott-gray: > In related news, I just finished the first winograd fprop/bprop fp32 kernel. It is fully fused and requires no additional memory. But the big news is that it...

@rsdubtso, > Hi, I'm one of the developers who worked on this package. [...] Looking at the logs, I see that the speedups for the convolution layers are not as...

@andravin > F(2x2,3x3) has a maximum speedup of (2x2x3x3)/(4x4) = 2.25 This is the number in the paper from below (9). But this is multiplications only. What about additions? Standard...

I see. It would totally make sense to introduce FLIP, AKA "arithmetic complexity" from your paper. Direct uses 36 FLIP and Winograd F(2x2,3,3) uses 16 FLIP. We can't use "FLOP",...

@scott-gray, I think your 10 vTFLOP/s, **163%** utilization for F(2x2,3x3), K32xN8 is more impressive that you modestly describe. For N>>1, K>>1 max theoretical utilization is 36/16=**2.25**. But for N=8, you...

@scott-gray on reddit: > Once I have these kernels well tuned I'll move onto the much more complicated F(4x4,3x3) transform where as much as a 4x speedup may be possible...