blis icon indicating copy to clipboard operation
blis copied to clipboard

Feature request for BLIS error handling...

Open jdiamondGitHub opened this issue 4 years ago • 3 comments

Blis currently handles user errors by printing an error message to the console and then signaling an abort. This is reasonable in a situation where a programmer is calling blis directly. However, we like to use blis inside our own libraries, which are then part of applications, some even being called by other languages and frameworks as behind the scenes acceleration. So when the blis error happens, (1) there is no console to see the error, and (2), we can't hook the abort, so not only does blis go down, but the entire application framework crashes, being all part of the same user process. Since this is a mission critical business application, we can't allow the server to potentially crash.

Our feature request would be to provide an alternate error mode in blis that would allow some method to return an error to the blis calling code so it can be dealt with by the higher level code utilizing blis and allow the system to continue on. A poor man's approach would be to provide an error function callback that we could use to implement our own global error checking system. This would put the most work on us, but would have the advantage that everyone could define whatever error reporting worked best for them.

An even better solution (from our point of view) would be to provide some kind of standard non blocking way for an application to check after a blis call that an error has occurred and see what the error was. The cleanest way would be to return an error code from the BLIS call, but that would change the API, so a more awkward version would be to provide a second function that you call that determines if the blis call succeeded, and if not, what went wrong.

Thanks for your thoughts on what kind of alternate error system would be most generally useful to the community.

  • Jeff

jdiamondGitHub avatar Jan 29 '21 19:01 jdiamondGitHub

Thanks for your input Jeff. I'll start thinking seriously about how best to accommodate your application's needs.

So far, I'm partial to using proper error return codes from user-level functions, even though (a) it will take more work and (b) it will break the API. The breakage doesn't scare me because it only affects the return values. But practically speaking, this may not be the best route. I'll begin assessing how much work it would take.

@devinamatthews What happens when a program that calls func() is compiled according to a prototype that suggests it returns void, but is then linked to an implementation that actually returns an integer? Is the integer return value merely ignored?

fgvanzee avatar Feb 01 '21 23:02 fgvanzee

@fgvanzee it's harmless. The return value will either be in a (callee-owned) register or on the callee stack, which the calling code will ignore either way. The ABI is also the unaffected.

devinamatthews avatar Feb 01 '21 23:02 devinamatthews

I think I've come up with a feasible plan to overhaul the way errors are handled that will both preserve the status quo (as an option) as well as provide Jeff with his preferred solution.

Under the changes I envision, BLIS will provide two options, both of which can be changed at runtime (with the initial default for each set at configure-time):

  • error handling mode. This is how BLIS reacts to an error. When the handling mode is set to return, BLIS will return an error code up the stack to the caller. This error code can then be interpreted with the help of another BLIS function(s) that can, for example, return a string describing the error given the error code as input. When the mode is set to abort, BLIS will behave as it currently does, aborting immediately upon detecting any error.
  • error checking level. This determines the scope of the errors BLIS checks for. This feature is more or less already implemented in BLIS, although currently there are only two levels: "full" error checking and no error checking. Notice that this is orthogonal to the handling mode, which only kicks in when errors are detected. If the error checking level is set to "none," then functions that return error codes will generally return BLIS_SUCCESS regardless of whether any errors would have been detected (since BLIS didn't bother to actually test any error conditions), and similarly under those conditions BLIS should never abort.

fgvanzee avatar Feb 04 '21 20:02 fgvanzee