X. Sherry Li
X. Sherry Li
This is good. However, the internal BLAS is not fast. You need to link with a high performance BLAS. The one you were using in Xcode does not have single...
A good public domain BLAS is OpenBLAS: https://www.openblas.net/
If you use GPU offload, the GEMM time is not collected. See the code segment here: https://github.com/xiaoyeli/superlu_dist/blob/b88e53497be29627eedf56c28e4808ef9d234c5b/SRC/double/pdgstrf.c#L1735 We need to add timer in dSchCompUdt-gpu.c. We will fix this later.