mlx-swift icon indicating copy to clipboard operation
mlx-swift copied to clipboard

[Bug] Crash in `item()`.

Open tqtifnypmb opened this issue 1 year ago • 16 comments

Code:

let tokenId: Int = logits[idx].item()

Occurs occasionally, not 100% reproducible

Crash Stacks:

0   libsystem_kernel.dylib        	0x00000001e9a86974 __pthread_kill + 8 (:-1)
1   libsystem_pthread.dylib       	0x00000001fd5090ec pthread_kill + 268 (pthread.c:1717)
2   libsystem_c.dylib             	0x00000001a94d7b80 abort + 180 (abort.c:118)
3   libc++abi.dylib               	0x00000001fd428ccc abort_message + 132 (abort_message.cpp:78)
4   libc++abi.dylib               	0x00000001fd418e68 demangling_terminate_handler() + 320 (cxa_default_handlers.cpp:72)
5   libobjc.A.dylib               	0x00000001994860cc _objc_terminate() + 160 (objc-exception.mm:499)
6   libc++abi.dylib               	0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib               	0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib               	0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9  Demo              	                 0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)
10  Demo              	                 0x00000001031cb78c mlx_reshape + 108 (ops.cpp:633)
11  Demo              	                 0x00000001036c2cd8 MLXArray.reshaped(_:stream:) + 256 (MLXArray+Ops.swift:1974)
12  Demo              	                 0x00000001036b2610 getItemND(src:operations:stream:) + 2404 (MLXArray+Indexing.swift:722)
13  Demo              	                 0x00000001036b5190 MLXArray.subscript.getter + 436

Any hints about what is happening behind the scenes or how to fix it?

Thanks

tqtifnypmb avatar May 04 '24 04:05 tqtifnypmb

Sure, there may be more to the message -- it is throwing an exception here:

6   libc++abi.dylib               	0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib               	0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib               	0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9   AI Transcription              	0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)

That is this line:

https://github.com/ml-explore/mlx/blob/main/mlx/ops.cpp#L328

    msg << "[reshape] Cannot reshape array of size " << a.size()
        << " into shape " << shape << ".";
    throw std::invalid_argument(msg.str());

You are doing an array index operation, not a reshape, but there are reshapes inside the indexing code. It would be good to see what the exception message is -- if there is a crashlog it might be in there.

If you get this in Xcode you can go to frame 9 and:

p msg.str()

(I think that is the right call, but maybe msg.c_str().

davidkoski avatar May 04 '24 04:05 davidkoski

Another two cases.

It appears they are due to the same issue. (Crash when eval)

I wonder if running it on the CPU will make a difference?


Case 1:

#0	0x00000001084a0428 in pthread_mutex_lock ()
#1	0x000000018c7f0300 in std::__1::mutex::lock() ()
#2	0x00000001035be5c8 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:41
#3	0x00000001035be504 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:40
#4	0x000000010425fd10 in void mlx::core::scheduler::StreamThread::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:61
#5	0x000000010425fce0 in void mlx::core::scheduler::Scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /gsebzzaktrijzzdiyqoncozizhfk/SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:150
#6	0x000000010424d908 in void mlx::core::scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:157
#7	0x000000010424d2cc in mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:165
#8	0x000000010424da74 in mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:176
#9	0x00000001035c0034 in mlx::core::array::eval() at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/array.cpp:101
#10	0x0000000103554b84 in ::mlx_array_eval(mlx_array) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/array.cpp:189
#11	0x00000001042f3578 in MLXArray.asArray<τ_0_0>(_:) at /SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray.swift:264

Case 2:

0   libsystem_pthread.dylib       	0x00000001ffd2f400 pthread_mutex_lock + 12 (pthread_mutex.c:1709)
1   libc++.1.dylib                	0x00000001a5e3ad20 std::__1::mutex::lock() + 16 (mutex.cpp:38)
2   Demo              	0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:41)
3   Demo              	0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:40)
4   Demo              	0x0000000100f9505c void mlx::core::scheduler::StreamThread::enqueue<std::__1::function<void ()>>(std::__1::function<void ()>&&) + 28 (scheduler.h:61)
5   Demo              	0x0000000100f958b0 void mlx::core::scheduler::Scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 16 (scheduler.h:150)
6   Demo              	0x0000000100f958b0 void mlx::core::scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 20 (scheduler.h:157)
7   Demo              	0x0000000100f958b0 mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) + 1788 (transforms.cpp:144)
8   Demo              	0x0000000100f95e54 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) + 56 (transforms.cpp:176)
9   Demo             	0x0000000100b0567c mlx::core::array::eval() + 196 (array.cpp:101)
10  Demo              	0x0000000101003de8 MLXArray.item<A>(_:) + 236 (MLXArray.swift:169)
11  Demo              	0x00000001009b8408 specialized MLXArray.item<A>() + 24 (<compiler-generated>:0)

tqtifnypmb avatar May 05 '24 03:05 tqtifnypmb

These look a little different -- I think it might be an assertion failure inside the pthread mutex. Are you using threads or async?

davidkoski avatar May 05 '24 04:05 davidkoski

If you have a snippet that can reproduce the error I can investigate it.

davidkoski avatar May 05 '24 04:05 davidkoski

Three cases above are in a thread.

I can’t provide meaningful snippets. I just simply call “item” and it occurs occasionally. Found no pattern to reproduce them

tqtifnypmb avatar May 05 '24 04:05 tqtifnypmb

MLXArray is not thread safe -- you can't create one on one thread and evaluate it (e.g/ calling item()) on another. This includes loading models and doing inference on different threads.

Try it without threads and see how it fares, if you can.

davidkoski avatar May 06 '24 01:05 davidkoski

            throw std::invalid_argument(
                "[eval] Attempting to eval an array without a primitive.");

and I can reproduce that in python -- it looks like an issue in mlx.core.

davidkoski avatar May 06 '24 15:05 davidkoski

Filed https://github.com/ml-explore/mlx/issues/1083.

davidkoski avatar May 06 '24 15:05 davidkoski

Strange, there used to be a snippet of code in here! Anyway, per the discussion in the other issue, this is the problem:

func foo(x: MLXArray) -> MLXArray {
    let y = MLX.softmax(x, axis: -1)
    let logits = y[0,0,0].item(Float.self)
    return MLX.full([1, 100], values: MLXArray(logits))
}

The call to item() inside this function is illegal inside a compile (and it doesn't give a clear message) -- compile requires that the code produce an unbroken graph. Change it like this:

func foo(x: MLXArray) -> MLXArray {
    let y = MLX.softmax(x, axis: -1)
    let logits = y[0,0,0]
    return MLX.full([1, 100], values: y)
}

davidkoski avatar May 06 '24 16:05 davidkoski

Oops! I deleted the snippet because it didn't accurately reflect my question.

tqtifnypmb avatar May 07 '24 04:05 tqtifnypmb

More crash info:

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x00005fff69a438b0
Exception Codes:       0x0000000000000001, 0x00005fff69a438b0

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [3477]

VM Region Info: 0x5fff69a438b0 is not in any region.  Bytes after previous region: 105069557332145  Bytes before following region: 2522597200
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      commpage (reserved)        1000000000-7000000000   [384.0G] ---/--- SM=NUL  ...(unallocated)
--->  GAP OF 0x5f9000000000 BYTES
      MALLOC_NANO              600000000000-600020000000 [512.0M] rw-/rwx SM=PRV  

tqtifnypmb avatar May 08 '24 08:05 tqtifnypmb

OK, that is trying to access memory that is not mapped. The most likely cause is out of bounds indexing: the underlying mlx::core::array does not do bounds checking.

What does idx look like here and what is the shape of your array?

davidkoski avatar May 08 '24 15:05 davidkoski

I am pretty sure that the idx is within the bounds logically. (disregarding the possibility of MLX's array operations themselves provide incorrect results)

I think there are some data racing issues, and changing the default device seems to trigger some synchronization mechanism. After testing for a few days, I have discovered the following workaround (at least so far, there have been no further instances of crashing):

// run on gpu
let logits = model(x)

MLX.Device.setDefault(device: .cpu)

// all array operations manipulating `logits` that call `eval` under the hook run on cpu

tqtifnypmb avatar May 13 '24 08:05 tqtifnypmb

Well, running everything on the cpu probably isn't getting all the advantages of mlx :-) If you ever have a repo that can reproduce the issue I would be happy to investigate further.

If not, are we ok to close this?

davidkoski avatar May 20 '24 15:05 davidkoski

It might be an issue with core as well.. hard to tell because there is no code to repro here. But @tqtifnypmb if you are able to provide the Swift repro code or similar code in Python that would be useful.

awni avatar May 20 '24 15:05 awni

I would very much like to help, I've tried to extract a minimal repo. But in the process, I found that the logic is the same as whisper-mlx, so I guess using whisper should be the same. It's worth noting that most of the crashes come from the iOS system. The frequency on macOS is very low.

tqtifnypmb avatar May 23 '24 06:05 tqtifnypmb