mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Segfault during inference

Open maxlund opened this issue 6 months ago • 4 comments

Crashed Thread:        23

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000700
Exception Codes:       0x0000000000000001, 0x0000000000000700

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [49912]


Thread 23 Crashed:
0   AGXMetalG13X                  	       0x32128d734 -[AGXG13XFamilyCommandBuffer tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:] + 180
1   AGXMetalG13X                  	       0x32128d618 -[AGXG13XFamilyCommandBuffer computeCommandEncoderWithConfig:] + 84
2   AGXMetalG13X                  	       0x32128d544 -[AGXG13XFamilyCommandBuffer computeCommandEncoderWithDispatchType:] + 136
3   libmlx.dylib                  	       0x32351dda8 mlx::core::metal::CommandEncoder::CommandEncoder(mlx::core::metal::DeviceStream&) + 140
4   libmlx.dylib                  	       0x3235203a8 mlx::core::metal::Device::get_command_encoder(int) + 284
5   libmlx.dylib                  	       0x323559640 mlx::core::RandomBits::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 484
6   libmlx.dylib                  	       0x32355605c mlx::core::metal::eval(mlx::core::array&) + 192
7   libmlx.dylib                  	       0x322b0608c mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) + 4736
8   libmlx.dylib                  	       0x322b06c58 mlx::core::async_eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) + 112
9   core.cpython-310-darwin.so    	       0x320219d78 0x320180000 + 630136

Environment:

import numpy as np
import mlx.core as mx
import mlx_whisper
import platform

print(f"numpy: {np.__version__}")
print(f"mlx: {mx.__version__}")
print(f"mlx_whisper: {mlx_whisper.__version__}")
print(f"macOS {platform.mac_ver()}")

Prints:

numpy: 1.26.4
mlx: 0.24.1
mlx_whisper: 0.4.1
macOS ('15.0.1', ('', '', ''), 'arm64')

Tried to reproduce letting it run overnight transcribing many hours, no luck. Seen it happen a few times now though.

Thanks for all your great work!

maxlund avatar Jun 04 '25 07:06 maxlund

Thanks for the crash report.. that's pretty odd. Are you able to share the code that caused the crash? Even if it's not consistent that would be of great help to pin this down.

awni avatar Jun 04 '25 13:06 awni

Sure, although it's not very interesting. I'm not sure which file was being transcribed when it happened, I can test letting it run overnight and see if I can reproduce it, but either way it worked the next time I ran it, so it's not a corrupt file.

Something like:

import mlx_whisper

mlx_whisper_result = mlx_whisper.transcribe(
    path_or_hf_repo=path_to_model,
    audio=audio_file, 
    language=language, 
    verbose=False
)

That's it as far as the mlx_whisper code is concerned.

It also seems like the memory allocation when doing transcriptions for many hours shoots up to very high levels, with a lot of swap being used. This seems to be especially prevalent when transcribing very long audio files. I tried running some profiling of the Python to find possible memory leaks, but I couldn't see anything - guess it's "to deep" in native code to have visibility from Python? Although macOS must be doing something smart because I often don't notice any slowdown of my system even when the swap is at 20GB on my 16GB MBP M1.

One thing that is probably interesting to note is that I'm running mlx_whisper from within a Python binary application built with PyInstaller.

maxlund avatar Jun 04 '25 18:06 maxlund

It would be useful to know if there is a pattern to the segfault. Like it always reproduces on a certain file or file of a certain size.

Another option if you are feeling a bit bolder is to run the program in a debugger and see if we can get the stack trace for when it segfaults.

Any info you can provide would help a lot to narrow this down a bit. Otherwise it's quite unlikely we'll be able to much here until happen to find a more consistent repro.

awni avatar Jun 06 '25 22:06 awni

Yep I realize this is a very difficult thing to action without anything more to go on than what I've provided so far. I'll keep trying to reproduce it and get back to you once I have something.

maxlund avatar Jun 07 '25 08:06 maxlund