Devin Matthews
Devin Matthews
Yes, TBLIS has TBB support and it is almost as performant as optimized OpenMP. The idea is to port this to BLIS at some point but there is no ETA...
@pjknowles can you attach the config.log from src/external/tci?
@pjknowles on MacOS it looks like this was a race condition in the pthreads code. I pushed a fix to the `develop` branch. I'm testing all the threading backends now...
@xrq-phys Thanks for this. `armv8a` is fine as long as it is generic enough across those uarchs (e.g. do Cortex-A53 and ThunderX2 share blocking paramters?). The only confusion is with...
> Noticed that TBLIS requires block sizes to be compile-time constants (i.e. constexprs). Yes, although it would be possible to kludge runtime numbers in there. MR/NR do actually have to...
I had no idea it was possible to cross-compile for macOS... the atomic functions come from libatomic on linux and libSystem on macOS (or are builtins perhaps?), so maybe that...
The matrix multiplication primitives are essentially the same as in [BLIS](https://github.com/flame/blis); you can find lots of performance graphs for BLIS [here](https://github.com/flame/blis/blob/master/docs/Performance.md). It is typically as fast or faster than OpenBLAS...
The problem is that TBLIS doesn't currently include optimized complex microkernels for most architectures. Basically you are getting a slightly fancy triple loop for these cases. The approach that I...
Yes, unfortunately complex conjugation does not work at the moment. I should get time to fix this in a month or so. As for the conjugation flag, it is an...
Thanks for the suggestion @marcinz. I will make a simple patch for now (same comment applies as in #45), but I'll raise this suggestion in BLIS.