Devin Matthews

Results 264 comments of Devin Matthews

> I recommend that we insert a comment somewhere (in the .appveyor.yml file?) that will remind future readers of this issue. Done

Apparently *all* test failures are ignored in the Makefile (prepended with "- "). @fgvanzee can you remind me what the rationale for this is? Makes it hard for other people...

No, L1 and L2 operations are still single-threaded.

The `generic` implementation will have better cache behavior than netlib BLAS, but will also do packing which will slow things down for small and medium-sized matrices. It's not totally clear...

OK, I guess I'm not really clear why you care about the performance of the BLIS `generic` configuration. Even with cache blocking it will never be "high performance".

@loveshack Returning to the original question: I think one way to make the "generic" implementation faster would be to add a fully-unrolled branch and temporary storage of C to the...

@loveshack What architectures in particular are you having a problem with?

@fgvanzee I was mostly talking about the actual `generic` configuration vs. the reference kernel being used in a particular configuration.

> i686, ppc64, ppc64le, and s390x @loveshack For which of those architectures can we assume vectorization with the default flags?

@fgvanzee I would suggest: 1. Changing the default MR and NR to 4x16, 4x8, 4x8, 4x4 (sdcz). 2. Rewriting the reference gemm kernel to: a. be row-major, b. be fully...