gflownet
gflownet copied to clipboard
Make multiprocessing terminate gracefully
Current multiprocessing/threading routines are not explicitly stopped, they just rely on the objects they belong to to be garbage collected to stop. This sometimes causes aesthetically displeasing logs where all the threads produce errors.
Addressed by #117 and #116. I will leave this up as a reminder to run tests but most problems on this front are presumably solved.
I did my best to flush all the queues, but I still think it ends up freezing on rare occasions on Beluga (compute Canada/calculate Quebec). I do not understand why it's just Beluga not Cedar/Narval/Mila's cluster. I have this snippet of code that I'd use if running jobs on the clusters :upside_down_face:
def haragiri(signum, frame):
os.kill(os.getpid(), signal.SIGTERM)
signal.signal(signal.SIGALRM, haragiri)
signal.alarm(10 * 60)
Another thing to consider is making the code work with other multithreading strategies. AFAIK set_start_method spawn or forkserver does not work currently but are the "recommended" way of starting new processes.