Devin Matthews
Devin Matthews
You aren't doing any threading along the M dimension (`BLIS_IC_NT`)?
Is this also a memory thing? Parallelizing along the IC loop would definitely be preferable. Alternatively, since you are currently just collapsing the IR/JR loops, why not set IR_NT=4 and...
@fgvanzee what might happen if the collapsed version were used all the time?
> But if we can find a more elegant way of expressing the logic that doesn't involve so much code duplication, I'm open to considering it. This was my concern...
@decandia50 if you configure using e.g. `configure intel` then it will compile in all the Intel architectures and select the proper one at runtime. While this isn't exactly the feature...
Oh, I didn't not see that it's *reproducibility* that is the main issue. I think this feature should be relatively easy to add, but i can't hazard a guess on...
Although, on Linux `perf` is a much better tool.
@drew-parsons can you please extract the BLAS/LAPACK operation and parameters this corresponds to and construct a minimal reproducer (Fortran or C/CBLAS)?
There's definitely a mismatch in the Fortran arguments. Here is the docs if you want to take a stab a fixing it (I can't take a look until at least...
This is always the problem with Python wrappers... how feasible is it to try and get a backtrace of the segfault when called from Python?