caffeine
caffeine copied to clipboard
Implement graceful error handling for allocation failures
In the near term, Caffeine will likely treat most errors as immediately fatal (ideally with a high-quality message as part of the error crash).
However one particularly important error that IMO should not get this treatment is memory allocation. Unlike hardware failure, out-of-memory is a common condition when scaling problems in real production science, and needs to be handled in a robust manner by a production-quality runtime. It's even plausible that some applications might perform non-trivial recovery from allocation failure.
prif_allocate
and prif_allocate_non_symmetric
currently ignore the possibility of errors and I suspect they crash in obscure ways upon memory exhaustion. IMO these two calls should be fixed to strictly adhere to Fortran error handling semantics, specifically wrt returning meaningful stat
and errmsg
(when provided) or crashing with a useful console message (when not provided). Ideally the error message in either case should include status information about the initial and current state of the shared heaps, and recommendations to the end-user about how to resolve the problem.