OpenBLAS
OpenBLAS copied to clipboard
Error handling is poor
Error handling in OpenBLAS is currently poor and inconsistent. Some routines, such as gemm_driver, call exit when memory cannot be allocated; that's a no-no for library code. Other routines are even worse: they fail to check whether memory has been allocated and just continue, causing segfaults. Error messages may be printed, but sometimes to go to stdout, sometimes to stderr.
Apple's BLAS has a SetBLASParamErrorProc function that allows installing a custom error handler. Implementing this API in OpenBLAS might be a good idea.
Thank you for the suggestion.
I tried fixing some unchecked malloc calls, but I got stuck in the code as most functions are just declared CNAME and the name is generated by the build system. Any hints on how to find out which function is which? (This was specifically in the threaded L3 BLAS code.)
In Makefile.system -DCNAME=$(*F)
For example, interface/gemm.c
In interface/Makefile,
cblas_sgemm.$(SUFFIX) cblas_sgemm.$(PSUFFIX) : gemm.c ../param.h
$(CC) -DCBLAS -c $(CFLAGS) $< -o $(@F)
CNAME=cblas_sgemm
Xianyi
I got OpenBLAS: malloc failed in gemm_driver when using 80 threads. What is the maximum number of threads I can use without running into these malloc issues?
Strange - what is your hardware, and how much memory do you have ? The job array in the gemm driver is only on the order of 20 bytes per thread, and I do not recall this ever having been an issue
This is my hardware (it's a cluster in a university).
Type: Silicon Mechanics Rackform R2504.V6 OS: Linux Ubuntu 18.04 LTS Processor: x2 20-core Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz RAM: 512 GB RAM
hmm, that hardware looks normal enough (though I'm not sure if I have something similar enough to reproduce the problem). Does it happen with a reasonably recent OpenBLAS, and with some publicly available source ?
I use Julia 1.9 with Flux 0.14.6. I don't actually know what my version of OpenBLAS is. I don't have a MWE because I have only gotten this error once, so I have not made a MWE but the code I was running was training a deep NN, if that helps at all.
I do not know Flux, but both version numbers appear to be current so I'm assuming your OpenBLAS is too. Best I could do at this point is have the error message write exactly how much memory it was trying to allocate at that point - but it cannot have been much, and I see no practical way for continuing with the GEMM call once it happens. The vast majority of OpenBLAS' memory requirement is in the memory buffer, sized at build time according to the maximum number of threads supported by that build. Would it be possible that your ML code itself was running in parallel, and making concurrent calls into OpenBLAS at the time (so effectively N*80 threads) ?
I really doubt that. When I set the number of threads to 8 (I have not tried a higher number) I do not get this error. I will report here if I see this error again.