when it crashed in the background, can not catch the exception
This is thrown on the metal worker thread -- it is possible that this is not set up to deliver this back to the calling thread (the one calling eval).
I think this is the same as ml-explore/mlx-swift#274 -- the suggestion to try and catch the error didn't work because the error wasn't surfaced there.
This is thrown on the metal worker thread -- it is possible that this is not set up to deliver this back to the calling thread (the one calling eval).
I think so too. Is there any way to catch this abnormality? I've been blocked for 3 or 4 days, and I don't have any ideas.
I think this is the same as #274 -- the suggestion to try and catch the error didn't work because the error wasn't surfaced there.
it can not catch the exception
It would have to be a change on the mx::core side (mlx project).
Let me see about transferring this issue.
Specifically the request is:
- if there is an uncaught exception on the worker thread, surface that or a proxy back in the eval
Any updates, boss?
@awni I think this requires a change on the mlx (core) side -- do you agree? Can you move it to the mlx repo? Thanks!
Yea I can move it... but it's not likely we are going to add this feature in the near future. The reason is that we don't have guarantees on the state being in a reasonable condition if there is an exception during eval.
It's better to treat eval as something which shouldn't crash in your application. If it is crashing, then we should fix that (if it's an MLX issue) or you should fix it in the calling code if it's an issue there.
I think this might be a race -- the GPU could become unavailable after eval is called, e.g. if an app goes into the background on iOS.
My model is quite time-consuming when performing eval, probably taking about 5 seconds. If I switch to the background while eval is in progress, it crashes, and currently, I have no way to prevent it.
Related: https://github.com/ml-explore/mlx/issues/2106, https://github.com/ml-explore/mlx/issues/1231, https://github.com/ml-explore/mlx/issues/1363 . We should probably merge them into one issue.