OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Error handling is poor

Open larsmans opened this issue 12 years ago • 10 comments

Error handling in OpenBLAS is currently poor and inconsistent. Some routines, such as gemm_driver, call exit when memory cannot be allocated; that's a no-no for library code. Other routines are even worse: they fail to check whether memory has been allocated and just continue, causing segfaults. Error messages may be printed, but sometimes to go to stdout, sometimes to stderr.

Apple's BLAS has a SetBLASParamErrorProc function that allows installing a custom error handler. Implementing this API in OpenBLAS might be a good idea.

larsmans avatar Sep 05 '13 13:09 larsmans

Thank you for the suggestion.

xianyi avatar Sep 05 '13 15:09 xianyi

I tried fixing some unchecked malloc calls, but I got stuck in the code as most functions are just declared CNAME and the name is generated by the build system. Any hints on how to find out which function is which? (This was specifically in the threaded L3 BLAS code.)

larsmans avatar Sep 05 '13 15:09 larsmans

In Makefile.system -DCNAME=$(*F)

For example, interface/gemm.c

In interface/Makefile,

cblas_sgemm.$(SUFFIX) cblas_sgemm.$(PSUFFIX) : gemm.c ../param.h                   
        $(CC) -DCBLAS -c $(CFLAGS) $< -o $(@F) 

CNAME=cblas_sgemm

Xianyi

xianyi avatar Sep 05 '13 15:09 xianyi

I got OpenBLAS: malloc failed in gemm_driver when using 80 threads. What is the maximum number of threads I can use without running into these malloc issues?

Vilin97 avatar Oct 20 '23 18:10 Vilin97

Strange - what is your hardware, and how much memory do you have ? The job array in the gemm driver is only on the order of 20 bytes per thread, and I do not recall this ever having been an issue

martin-frbg avatar Oct 20 '23 20:10 martin-frbg

This is my hardware (it's a cluster in a university).

Type: Silicon Mechanics Rackform R2504.V6 OS: Linux Ubuntu 18.04 LTS Processor: x2 20-core Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz RAM: 512 GB RAM

Vilin97 avatar Oct 25 '23 00:10 Vilin97

hmm, that hardware looks normal enough (though I'm not sure if I have something similar enough to reproduce the problem). Does it happen with a reasonably recent OpenBLAS, and with some publicly available source ?

martin-frbg avatar Oct 25 '23 02:10 martin-frbg

I use Julia 1.9 with Flux 0.14.6. I don't actually know what my version of OpenBLAS is. I don't have a MWE because I have only gotten this error once, so I have not made a MWE but the code I was running was training a deep NN, if that helps at all.

Vilin97 avatar Oct 30 '23 22:10 Vilin97

I do not know Flux, but both version numbers appear to be current so I'm assuming your OpenBLAS is too. Best I could do at this point is have the error message write exactly how much memory it was trying to allocate at that point - but it cannot have been much, and I see no practical way for continuing with the GEMM call once it happens. The vast majority of OpenBLAS' memory requirement is in the memory buffer, sized at build time according to the maximum number of threads supported by that build. Would it be possible that your ML code itself was running in parallel, and making concurrent calls into OpenBLAS at the time (so effectively N*80 threads) ?

martin-frbg avatar Oct 31 '23 08:10 martin-frbg

I really doubt that. When I set the number of threads to 8 (I have not tried a higher number) I do not get this error. I will report here if I see this error again.

Vilin97 avatar Nov 02 '23 05:11 Vilin97