nicegui icon indicating copy to clipboard operation
nicegui copied to clipboard

BrokenProcessPool on run.cpu_bound

Open gotev opened this issue 2 years ago • 4 comments

Description

I'm trying to execute multiple parallel tasks which performs CPU bound activities, so I'm using run.cpu_bound.

I expect a failed task to not cause the others to fail as well. What happens is that if a task launched inside a process causes the process to crash (not infrequent when you launch C/C++ apps) in case of error (or OOM issues), a BrokenProcessPool Exception gets raised for all the subsequent tasks. The pool becomes unusable for the rest of the nicegui app's execution, until next restart. That's just how python's standard ProcessPoolExecutor works. Excerpt from the docs: https://docs.python.org/3/library/concurrent.futures.html

initializer is an optional callable that is called at the start of each worker process; initargs is a tuple of arguments passed to the initializer. Should initializer raise an exception, all currently pending jobs will raise a BrokenProcessPool, as well as any attempt to submit more jobs to the pool.

from nicegui import ui, run
import os

def crash_process():
    # Cause the process to exit with a non-zero exit status
    os._exit(1)

def fine_process():
    print('Hey this should be printed')

async def on_long_operations():
    try:
        print('First task')
        await run.cpu_bound(crash_process)
    except Exception as e:
        print(f'First task error: {e}')

    try:
        print('Second task')
        await run.cpu_bound(fine_process)
    except Exception as e:
        print(f'Second task error: {e}')

    try:
        print('Third task')
        await run.cpu_bound(crash_process)
    except Exception as e:
        print(f'Third task error: {e}')

ui.button('Do the Job', on_click=on_long_operations)

ui.run()

which outputs:

NiceGUI ready to go on http://localhost:8080
First task
First task error: A process in the process pool was terminated abruptly while the future was running or pending.
Second task
Second task error: A child process terminated abruptly, the process pool is not usable anymore
Third task
Third task error: A child process terminated abruptly, the process pool is not usable anymore

By searching a bit, I've seen some projects employ a custom logic and other ones completely re-implement the pool.

  • One tactic is to intercept the broken process pool exception and then restart it. It's more of a workaround and there are some edge cases to handle, like tasks launched while restarting the pool and tasks already submitted to the pool and running or pending while one of the processes crashes
  • Another one is to use a library like deadpool.

Note: One can always employ a custom solution in a case like this, but I thought it will be useful to share this problem here with the community and decide what to do about scenarios like this. The framework is pretty solid and the process pools are well integrated with the app lifecycle, so IMHO finding a solution to this can only improve the quality of the framework and the apps produced with it.

gotev avatar Mar 27 '24 19:03 gotev

Yes, a more robust solution would be awesome. But I'm not sure what the best way forward is...

rodja avatar Mar 28 '24 06:03 rodja

@rodja it's a tough one, so I propose to start by reasoning about the use cases and think about a solution only after we are confident about the cases we want to cover.

Some thoughts and ideas to start with:

  • when a single running process crashes, it should not crash or stop the other ones which are either scheduled or running and should not prevent new ones from being scheduled
  • a way to cancel a single running process. So, when a cpu_bound process is scheduled, having its "handle".
  • a way to define a group of processes and be able to also cancel the whole group at once. One way could be introducing an additional parameter, like run.cpu_bound('groupId', function, args)

gotev avatar Mar 28 '24 09:03 gotev

Sounds like a good approach. I think we should first get #2234 to work, so we have a basis for testing and verifying that everything works.

rodja avatar Apr 08 '24 09:04 rodja