Devin Matthews
Devin Matthews
@jeffhammond suggestions for any specific architectures?
@hominhquan thanks for your analysis. In practice, `BLIS_IR_NT` is always 1, as threading this loop just doesn't make sense on any architecture we've seen. Without diving into the details on...
I should also add that realistically `BLIS_JR_NT
OK. Do you have any performance numbers? It sounds like we can improve general parallel performance then.
@hominhquan is the general issue that `BLIS_JR_NT` should be a divisor of `BLIS_NC/BLIS_NR`? As I mentioned before, `BLIS_JR_NT` also shouldn't be very large, maybe 4 at most.
No BLIS can't override the user's choices, but I guess then it's up to the user to not do that. I just added some notes to the Multithreading documentation.
@hominhquan just had in interesting conversation where some cases were pointed out where `BLIS_IR_NT > 1` make sense performance-wise. Does `BLIS_IR_NT = BLIS_JR_NT = 4` give you reasonable performance and...
@jeffhammond I'm not sure what you mean by this? I gather there is still a bug? but I'm not sure what
Comments on #351
> up until now we knew all desktop Skylakes to be sans AVX-512 Except for Skylake-X i7 and i9! And Cannon Lake! And Cascade Lake!