array icon indicating copy to clipboard operation
array copied to clipboard

whole-matrix einsum is slower than hand-written code

Open dsharlet opened this issue 5 years ago • 2 comments

The current timing I see for the matrix tests is:

reduce_matrix time: 10.6356 ms
einsum time: 13.4737 ms
reduce_tiles time: 0.890935 ms
einsum_tiles time: 0.929729 ms

The einsum_tiles version matches reduce_tiles closely enough to be zero-cost, but something is wrong with einsum vs. reduce_matrix. This is strange because it is the easier case.

dsharlet avatar Jul 29 '20 05:07 dsharlet

I just added an einsum version of all 3 kinds of matrix multiply:

reference time: 37.1274 ms
reduce_cols time: 36.2295 ms
einsum_cols time: 36.356 ms
reduce_rows time: 2.61265 ms
einsum_rows time: 2.79075 ms
reduce_matrix time: 11.3214 ms
einsum_matrix time: 12.8849 ms
reduce_tiles time: 0.907943 ms
einsum_tiles time: 0.944206 ms

All the other einsum usages are close to their hand-written counterparts, just einsum_matrix seems a bit slow.

dsharlet avatar Jul 29 '20 05:07 dsharlet

It seems that on the compiler travis uses, the results are really interesting:

reference time: 67.4285 ms
reduce_cols time: 57.6136 ms
einsum_cols time: 52.8567 ms
reduce_rows time: 4.55528 ms
einsum_rows time: 3.27354 ms
reduce_matrix time: 4.84482 ms
einsum_matrix time: 15.1919 ms
reduce_tiles time: 12.926 ms
einsum_tiles time: 2.00451 ms
  • The plain C reference is the slowest version!
  • einsum is faster in every case except einsum_matrix, which is much slower.
  • `einsum_tiles apparently succeeds in vectorizing the tile reduction, where reduce_tiles does not.

These times appear to be consistent across travis runs too, so I don't think this is noise.

dsharlet avatar Jul 29 '20 05:07 dsharlet