Oleg Zabluda
Oleg Zabluda
> I clarified the tiling I was using in a later post here. It's actually 32x32, not 32x8. The 32x8 is what is visible to the user, but 32x32 is...
Aha! I get it now. For F(4x4,3x3) correct formula for K32xN8, X2xY2(=4) is For C>>3: 144/(36+156/32+72/4/8)=**3.3** For C==3: 144/(36+156/32+72/4/8+90/4/3)=**2.8** For the overlapped data transform, the correct number of FLIP is...
@scott-gray > The tile size is K32xN8 so it should be pretty versatile over a wide range of dimensions. Even C=3 performance is pretty good at a 1.7 vTflops. initially,...
@rsdubtso, thank you, these are great. I see great scalability to 2 sockets, with ~50% utilization (either one or two sockets), exactly opposite of my earlier guesses. Next natural experiments...
@rsdubtso: > I could not find the frequency fused for the i7 CPU mentioned above, but you can find it out using prime95 v27.9 or later, for example. However, current-related...
@emfomenk, thanks for the excellent table. Looking only at conv and fc layers, I see excellent scalability 2=>4=>8=>14=>28 in conv layers (except 2=>4=>8=14 in conv1 forward and 14=>28 in conv1...
@emfomenk, thank you for the summary. Sorry, I don't understand what you mean by > The behavior of Multinode Caffe almost duplicates the behavior of Singlenode Caffe. This puts some...
> I mean the algorithm in Multinode Caffe (underlying math) is the same as in Singlenode Caffe: Forward, Backward, SGD, the same parameters and so on. In particular it means...
> Article says """reached 80% top-5 accuracy in just over 5 hours on a 64-node""". Is that 90 epochs with minibatch=1024? AlexNet from the original paper reached 81.8% after 90...
> We always ran Caffe (singlenode and multinode versions) for 90 epochs Great. I think the web article should say that explicitly, especially since it is actually faster than what...