mlx Do not join threads during process exit on Windows

On Windows the process hangs on exit after running some ops, after some debugging I found that it stuck at the synchronize call in ~StreamThread, which was because the thread was already terminated by OS on process exit.

C++ does not seem to specify when threads are terminated during process exit, but at least on Windows the destructors of static objects are called after the stream threads get terminated, resulting in posting tasks in ~StreamThread not working and thus the hang.

A simple fix is to just let the OS collect the threads, i.e. what this PR does. If you prefer a clean exit, we can have the language bindings destroy the Schedule in their exit handlers, which are called before the actual process exit.

Dec 30 '24 04:12 zcbenz

but at least on Windows the destructors of static objects are called after the stream threads get terminated

You're saying StreamThread is deleted before Scheduler is deleted. Then in Scheduler destructor it tries to delete the StreamThread which joins on the thread and that causes a bug? What if we check thread.joinable in the destructor of StreamThread? Would it be false?

Dec 31 '24 14:12 awni

Yeah, the native OS thread got terminated on exit before the std::thread got destroyed. In my tests, once the main() function returns, all threads behaved like they received exit(0), and when using the win32 API to test their liveness, they were reported as exited.

    DWORD code = -1;
    GetExitCodeThread(thread.native_handle(), &code);
    fprintf(stderr, "exit code: %d\n", (int)code);

The thread.joinable() still returns true in this case, I think it is just a soft flag and does not check the actual status of the thread.

Dec 31 '24 23:12 zcbenz

I found myself accidentally closed the PR by deleting the branch. Reopening as it is still an issue, and it is blocking the Windows build to be actually useful since the process always hangs on exit.

Feb 11 '25 23:02 zcbenz

@awni I'm also seeing same problem in cuda and webgpu backends.

The scheduler synchronizes on exit, and it implies the device is still alive when ~Scheduler is called, which only happens to be true in metal backend because of the sequence of device()/scheduler() calls.

For backends where device happens to get destroyed before the scheduler, crash will happen on exit and the only way to work around it is to leak the device on exit. Interestingly PyTorch does this for similar reasons, but I think leaking the scheduler would be a cleaner solution.

A more decent way to solve this is to introduce a single instance that manages the lifetime of global instances like device and scheduler, though I think it is overkill here.

Mar 21 '25 02:03 zcbenz

Ok.. let's try and fix this.

I'm pretty sure the schedulers destructor has the synchronize in there for a reason. It may not be relevant anymore.. but I remember adding that synchronize because we were having issues exiting Python programs cleanly. Let me see if I can dig up the history of it a bit.

Mar 21 '25 02:03 awni

I tried running a few jobs with this change and so far so good. I can't quite remember why I put the synchronize in the stream destructors (https://github.com/ml-explore/mlx/pull/1006). Idk.. maybe @angeloskath you remember about that?

Either way things have changed a lot since then.. we do a lot more of the task submission on the main thread now so possibly that makes the synchronize unnecessary. I don't have any reason not to land this.

Mar 21 '25 03:03 awni

Hmm I think a possible crash is when running a slow computation and user presses Ctrl-C, then the device could be destroyed while the kernel is still running.

We can probably call synchronize() in the Py_AtExit hook to solve this, which ensures all tasks finish running before exiting process, how do you think?

Mar 21 '25 05:03 zcbenz

I'm not so keen on doing this in the Python exit hooks. Every other front-end would have the same issue and then we have to add custom deinits for every front-end which is both tedious and brittle. I would much prefer to solve this cleanly in the C++ core for all front-ends.

Mar 21 '25 13:03 awni

The scheduler synchronizes on exit, and it implies the device is still alive when ~Scheduler is called, which only happens to be true in metal backend because of the sequence of device()/scheduler() calls.

Maybe it is best to ensure the device is always destroyed after the scheduler.. if you are using a function static like we do elsewhere is should be sufficient to ensure it gets constructed before the scheduler.

Mar 21 '25 13:03 awni

Btw the reason the device is always initialized before the scheduler is a bit subtle (and may also need fixing).

Right now we provide a default PRNG key in MLX. The default key is initialized when you start MLX up before doing any ops -> The key is an array -> The array has data which means it makes a call to the allocator -> The allocator creates the device.

You can see in Python where it's constructed.

It may make sense to force the device to be active before the scheduler since it seems like we rely on this.

Mar 21 '25 14:03 awni

In fact, in MLX with the metal back end the device is never destroyed before the scheduler because when the scheduler is constructed it creates the static device if it does not exist. Maybe we can do the same with the CUDA back-end?

See e.g. https://github.com/ml-explore/mlx/blob/main/mlx/scheduler.h#L70-L72

Mar 21 '25 14:03 awni

In fact, in MLX with the metal back end the device is never destroyed before the scheduler because when the scheduler is constructed it creates the static device if it does not exist. Maybe we can do the same with the CUDA back-end?

Yeah we can do the same in cuda backend to work around this problem. But the original Windows issue still remains unsolved: joining threads on process exit is hanging.

(Actually I found a detailed report of this issue and the reason of hanging is different from what I assumed above. Technically it is a bug of MSVC but I wouldn't count on Microsoft to ever fix it.)

A possible compromise is to add a #ifdef to only leak scheduler for Windows code, with a comment saying we can remove it once the bug got fixed.

Mar 22 '25 00:03 zcbenz

Your solutions sounds reasonable:

Force the order for CUDA
IFDEF on heap for windows

Mar 22 '25 00:03 awni

I have updated this PR.

Mar 22 '25 06:03 zcbenz