Richard Janis Goldschmidt

Results 54 comments of Richard Janis Goldschmidt

@bluss Here is the doc for the `cblas_gemm_*`: https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-1#2A58B860-609A-44CC-9812-E47BD01810CC At the bottom you have implementation details. One of the few documents talking about it is this here: http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/Batched-BLAS-2017/talk12-gurney.pdf ~~Two relevant...

Note the comment at the bottom of the API doc (emphasis mine): > After computing these four multiplication terms separately, they are summed from left to right. The results from...

I have found mention of integer gemm in the context of BLIS, but it looks like nothing came of it: https://groups.google.com/forum/#!topic/blis-devel/qA00lB2yGY0

This discussion is revealing in terms of how to determine optimal kernel parameters: https://github.com/flame/blis/issues/253 In particular, [this](https://github.com/flame/blis/issues/253#issuecomment-423369745) states: > @VirtualEarth Turn your attention to [Eq. 1](http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf): > > ``` >...

@bluss: I have adjusted `gemm_packed` to work with kernels of shapes other then `8x8` and `8x4`. However, Rust seems to have issues finding the correct associated types and constants of...

@bluss Thanks for the note, was looking through RFCs. That's another use for const generics, I guess. I wonder if replacing the buffer by a `Vec` would be very bad....

Wow, for small matrices using a `Vec` leads to some serious slowdowns: ``` name buf_mask ns/iter vec_mask ns/iter diff ns/iter diff % speedup layout_f64_032::ccc 2,187 2,244 57 2.61% x 0.97...

Blowing up the masked buffer to 1024 (16*32*2, kernel is 16x32, i16 takes 2 bytes) elements at least doesn't seem to affect performance: ``` running 16 tests test layout_f64_032::ccc ......

This certainly needs more tuning. This is some terrible performance, as of now: ``` cargo bench i8 running 18 tests test layout_i8_128::ccc ... bench: 86,794 ns/iter (+/- 3,997) test layout_i8_128::ccf...

The reason for those atrocious numbers is that I probably don't have the number of available vector, `ymm*`, registers in mind. `avx` has 16 vector registers in total. This issue...