About locking when use OpenBLAS with OpenMP
Implementing an algorithm I ran into a problem how to better use OpenBLAS. So I have several matrix multiplications in omp parallel section. Resulting matrices should be summed up. So it is just $C = C + A * B$ (e.g., usual dgemm routine with shared $C$ and private $A$ and $B$ in omp parallel section), but might you clarify does OpenBLAS optimally deal with synchronization here when library was built with USE_OPENMP=1 USE_LOCKING=1? I mean something like summing in $C = C + A * B$ expression should be done after block of $A * B$ is calculated (so obviously elements of $C$ shouldn't be updated very often). Could you please tell if I have a correct idea about the implementation in OpenBLAS or do I need to take into account described remarks on my own? And if it would be better to study your code instead of asking such questions directly, just say so!)