Unit test Floating Point Exception
Building from source and running on an M1 I am getting a floating point exception from the Negatie slice and ascending bounds portion of the test_array.TestArray.test_slice_negative_step unit test. All tests pass when commenting the following lines out:
# Negatie slice and ascending bounds
b_np = a_np[0:20:-3]
b_mx = a_mx[0:20:-3]
self.assertTrue(np.array_equal(b_np, b_mx))
Excerpted exception details:
Exception Type: EXC_ARITHMETIC (SIGFPE)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Termination Reason: Namespace SIGNAL, Code 8 Floating point exception: 8
Thread 0:: Dispatch queue: com.apple.main-thread
0 ??? 0x7ff89ac8e9a8 ???
1 libsystem_kernel.dylib 0x7ff80b0430fe __psynch_cvwait + 10
2 libsystem_pthread.dylib 0x7ff80b07f758 _pthread_cond_wait + 1242
3 libc++.1.dylib 0x7ff80afbc1e2 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
4 libc++.1.dylib 0x7ff80afbca51 std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 45
5 libc++.1.dylib 0x7ff80afbcaba std::__1::__assoc_sub_state::wait() + 46
6 libmlx.dylib 0x10afa5648 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 4520
Thread 11 Crashed:
0 AGXMetal13_3 0x7ffa243eec5b AGX::ComputeService<AGX::G13::Encoders, AGX::G13::Classes, AGX::G13::ObjClasses, AGX::G13::EncoderComputeServiceClasses>::executeKernelWithThreadsPerGridImpl(eAGXDataBufferPools, MTLSize, MTLSize) + 363
1 AGXMetal13_3 0x7ffa243e90d5 -[AGXG13GFamilyComputeContext dispatchThreads:threadsPerThreadgroup:] + 197
2 libmlx.dylib 0x10b7876ae mlx::core::Arange::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 1742
3 libmlx.dylib 0x10b7833cf std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 143
4 libmlx.dylib 0x10afa2706 mlx::core::scheduler::StreamThread::thread_fn() + 518
5 libmlx.dylib 0x10afa2953 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 67
6 libsystem_pthread.dylib 0x7ff80b07f1d3 _pthread_start + 125
7 libsystem_pthread.dylib 0x7ff80b07abd3 thread_start + 15
Thread 11 crashed with X86 Thread State (64-bit):
rax: 0x000000007fffffff rbx: 0x0000000000000000 rcx: 0x0000000000000010 rdx: 0x0000000000000000
rdi: 0x0000000000000000 rsi: 0x0000000000000001 rbp: 0x00000003028c6c60 rsp: 0x00000003028c6bf0
r8: 0x0000000000000002 r9: 0x0000600003da15f8 r10: 0x00007ffa4dc030d8 r11: 0x00007ff814d09672
r12: 0x00007fc888350000 r13: 0x0000000000000000 r14: 0x00000003028c6c70 r15: 0x0000000000000000
rip: <unavailable> rfl: 0x0000000000000203
tmp0: 0x0000000000000000 tmp1: 0x000000007fffffff tmp2: 0x0000000000000000
That doesn't look good. Which commit are you building on?
Also can you give some more details on your environment? E.g. OS, python version, ...
Sure, this is on the latest commit on main, python 3.11.6, macOS 13.5 on an M1 MacBook Air.
Let me know if there's any other info that might be useful. Happy to try running any test code to help narrow things down.
I'm not able to repro on my M1 but my situation is a bit different :. I'm on 14.2 and M1 Max.
Does running the test standalone cause the issue as well?
e.g. just running with TestArray.test_slice_negative_step
Digging deeper, I am also getting the same exception from a few of the the C++ tests, including autograd_tests, creations_tests, and most interestingly this simple test from metal_tests:
a = array(std::initializer_list<bool>{});
CHECK_EQ(all(a, Device::gpu).item<bool>(), true);
The first two made it seem like a logic issue with empty slices, but the common thread seems to be array::eval on an empty array. Something that basic should be easily reproducible which points to something specific to my machine, but I have no idea what it could be.
Wow, that's interesting..
Can you try running some of those tests standalon? Like just the test that fails? (Comment out the rest or something). I'm really curious about:
- Is it reproducible consistenly
- Does it involve some state being corrupted from other tests or is it really just that test having issues..
Debugging just the test metal reduce test case it is reliably hitting a divide-by-zero (not floating-point as I saw in the others) exception in reduce.cpp due to thread_group_size (as well as in_size and mod_in_size) being 0.
This may be a separate issue than the floating-point exception.
Cool, thanks for finding that! That looks like an easy fix now that you narrowed it down. Interested in sending a PR for it?
I added an early-exit to reduce which allowed the metal tests to pass, but now I'm seeing more failures from ops_tests (in addition to autograd and creations mentioned above). That, plus the fact that these tests are passing on your machine, makes me think there's some deeper issue here.
The common thread is that all of these failures are from operations on empty arrays. It seems surprising the answer would be adding checks for length == 0 all over the place, but I'm still unfamiliar with the library internals.
added an early-exit to reduce which allowed the metal tests to pass
Rather than an early exit, I was thinking to check if there is a zero in the denom and set the thread group size to zero in that case.
That, plus the fact that these tests are passing on your machine, makes me think there's some deeper issue here.
This is very strange to me as well...
It seems surprising the answer would be adding checks for length == 0 all over the place, but I'm still unfamiliar with the library internals.
No that doesn't seem like a good idea. Most of our primitives can handle 0 sized inputs.
@ccostes There is definitely a divide by zero. The fact we don't get an exception makes me think you are using a different compiler as it is generally undefined behavior / compiler specific. I can get the divide by zero error if I add:
add_compile_options(-fsanitize=integer-divide-by-zero)
link_libraries("-fsanitize=integer-divide-by-zero")
@ccostes could you check my PR and see if it fixes your issue? I didn't see any other instances where we divide by integer 0.
#388
I think I found the problem: Rosetta! When building I didn't notice the following:
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
Building for x86_64 on macOS is not supported. If you are on an Apple
silicon system, make sure you are building for arm64.
I had iterm set to run with Rosetta, but even when unsetting that and verifying that uname -m returns arm64 (where it was x86_64 previously) cmake is still reporting that it is building for x86_64...
% uname -m
arm64
build % cmake ..
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
Building for x86_64 on macOS is not supported. If you are on an Apple
silicon system, make sure you are building for arm64.
EDIT: Turns out I was using the x86_64 version of homebrew this whole time, too! After re-installing homebrew, then cmake, and rebuilding mlx all tests are now passing!! 🎉
Sorry for the runaround, but thanks for all your help!
I'm having a similar issue when running tests from autograd_tests.cpp.
I follow your suggestions @ccostes, but I already have my terminal running without Rosseta
% uname -p
arm
And homebrew running over ARM build
% brew config
HOMEBREW_VERSION: 4.2.3
ORIGIN: https://github.com/Homebrew/brew
HEAD: b3751bca8c4a3c76107d5aa3b75b81db93cf4bb6
Last commit: 3 days ago
Core tap JSON: 11 Jan 03:57 UTC
Core cask tap JSON: 11 Jan 03:57 UTC
HOMEBREW_PREFIX: /opt/homebrew
HOMEBREW_CASK_OPTS: []
HOMEBREW_MAKE_JOBS: 10
Homebrew Ruby: 3.1.4 => /opt/homebrew/Library/Homebrew/vendor/portable-ruby/3.1.4/bin/ruby
CPU: 10-core 64-bit arm_firestorm_icestorm
Clang: 15.0.0 build 1500
Git: 2.39.3 => /Applications/Xcode.app/Contents/Developer/usr/bin/git
Curl: 8.4.0 => /usr/bin/curl
macOS: 14.2.1-arm64
CLT: 15.1.0.0.1.1700200546
Xcode: 15.2
Rosetta 2: false
When I run the build command
% cmake .. && make -j
-- The CXX compiler identification is AppleClang 15.0.0.15000100
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
Building for x86_64 on macOS is not supported. If you are on an Apple
silicon system, make sure you are building for arm64.
.....
And the tests raise and exception mlx/tests/autograd_tests.cpp:724: FATAL ERROR: test case CRASHED: SIGFPE - Floating point error signal
Yea I don't know if the Rosetta thing in terminal actually matters, I'm having doubts about that now
But I spot your problem:
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
Building for x86_64 on macOS is not supported. If you are on an Apple
silicon system, make sure you are building for arm64.
You are building for x86 for some reason.
Could you do which cmake and see what you get?
Thanks in advance for the time to answer me 🙏🏽
% which cmake
/usr/local/bin/cmake
I also reinstall XCode Command Line Tools and XCode Editor as well but the error continue
That's the x86 brew cmake installation. Might just need to brew install cmake again.
On Thu, Jan 11, 2024 at 12:43 PM Jhenner Tigreros @.***> wrote:
Thanks in advance for the time to answer me 🙏🏽
% which cmake /usr/local/bin/cmake
I also reinstall XCode Command Line Tools and XCode Editor as well but the error continue
— Reply to this email directly, view it on GitHub https://github.com/ml-explore/mlx/issues/338#issuecomment-1887931715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6EBAS24LV2VFGCIQPMQLYOBFF5AVCNFSM6AAAAABBJIOVOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXHEZTCNZRGU . You are receiving this because you were mentioned.Message ID: @.***>
I install cmake from brew and execute the cmake command and get the same output.
% file /opt/homebrew/bin/cmake
/opt/homebrew/bin/cmake: Mach-O 64-bit executable arm64
% /opt/homebrew/bin/cmake ..
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
Building for x86_64 on macOS is not supported. If you are on an Apple
silicon system, make sure you are building for arm64.
....
My hombrew installation is also arm64
% brew config HOMEBREW_VERSION: 4.2.3 ORIGIN: https://github.com/Homebrew/brew HEAD: b3751bca8c4a3c76107d5aa3b75b81db93cf4bb6 Last commit: 3 days ago Core tap JSON: 11 Jan 03:57 UTC Core cask tap JSON: 11 Jan 03:57 UTC HOMEBREW_PREFIX: /opt/homebrew HOMEBREW_CASK_OPTS: [] HOMEBREW_MAKE_JOBS: 10 Homebrew Ruby: 3.1.4 => /opt/homebrew/Library/Homebrew/vendor/portable-ruby/3.1.4/bin/ruby CPU: 10-core 64-bit arm_firestorm_icestorm Clang: 15.0.0 build 1500 Git: 2.39.3 => /Applications/Xcode.app/Contents/Developer/usr/bin/git Curl: 8.4.0 => /usr/bin/curl macOS: 14.2.1-arm64 CLT: 15.1.0.0.1.1700200546 Xcode: 15.2 Rosetta 2: false
Can you share the output of clang -v and g++ -v ?
Sure,
% clang -v
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
% g++ -v
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
Lol what is going on !!!
I have to read more about what triggers this flag CMAKE_HOST_SYSTEM_PROCESSOR as we clearly do not understand it well enough. Clearly it thinks you are building for x86 but all of your defaults look like they are for arm so I am super confused.
On macOS, the value of uname -m is used by default. However, since this might vary based on whether you're using x86 or ARM CMake, version 3.19.2+ will use the value of CMAKE_APPLE_SILICON_PROCESSOR (either CMake or environment variable) instead, if it is set. It also normalizes Power Macintosh to powerpc.
https://stackoverflow.com/questions/70475665/what-are-the-possible-values-of-cmake-system-processor
Can you try compiling with:
env CMAKE_APPLE_SILICON_PROCESSOR=arm64 /opt/homebrew/bin/cmake ..
make -j
I was reading a bit about this last night, but does not find any solution to the problem. I tried to reinstall all my dependencies but doesn't work, also I try to compile with the default bash terminal.
Can you try compiling with:
env CMAKE_APPLE_SILICON_PROCESSOR=arm64 /opt/homebrew/bin/cmake .. make -j
Gives me the same output related to building the code on x86 🤔 . If I force set env variable set(CMAKE_HOST_SYSTEM_PROCESSOR "arm64") in the CMakeLists.txt config file, it works
% cmake ..
-- Building MLX for arm64 processor on Darwin
-- Building METAL sources
-- Building with SDK for macOS version 14.2
-- Downloading json
-- Using the multi-header code from /Users/jhennertigreros/Documents/Personal/Code/mlx/build/_deps/json-src/include/
-- Accelerate found /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/System/Library/Frameworks/Accelerate.framework
CMake Deprecation Warning at build/_deps/doctest-src/CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
-- Configuring done (10.2s)
-- Generating done (0.1s)
-- Build files have been written to: /Users/jhennertigreros/Documents/Personal/Code/mlx/build
but the building process got some errors.
% make -j
/Users/xxxx/Documents/Personal/Code/mlx/mlx/backend/accelerate/softmax.cpp:78:3: error: unknown type name 'int16x8_t'; did you mean 'int16_t'?
int16x8_t epart = vcvtq_s16_f16(ipart);
^~~~~~~~~
int16_t
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/sys/_types/_int16_t.h:30:33: note: 'int16_t' declared here
typedef short int16_t;
^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make[2]: *** [CMakeFiles/mlx.dir/mlx/backend/accelerate/softmax.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/mlx.dir/all] Error 2
make: *** [all] Error 2
Digging in my current cmake --system-informationI found that the variable has the correct value
% cmake --system-information
....
CMAKE_HOST_SYSTEM_NAME "Darwin"
CMAKE_HOST_SYSTEM_PROCESSOR "arm64"
CMAKE_HOST_SYSTEM_VERSION "23.2.0"
...
Could you try running with make VERBOSE=1 just to see the actual compile commands. It will spit out a bunch of stuff, but just one of them would be interesting to see, something like:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DACCELERATE_NEW_LAPACK -DMETAL_PATH="/Users/awni/mlx/build/mlx/backend/metal/kernels//mlx.metallib" -D_METAL_ -I/Users/awni/mlx/build/_deps/metal_cpp-src -I/Users/awni/mlx/build/_deps/json-src/single_include/nlohmann -I/Users/awni/mlx/build/_deps/gguflib-src -I/Users/awni/mlx -O3 -DNDEBUG -std=gnu++17 -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk -fPIC -MD -MT CMakeFiles/mlx.dir/mlx/linalg.cpp.o -MF CMakeFiles/mlx.dir/mlx/linalg.cpp.o.d -o CMakeFiles/mlx.dir/mlx/linalg.cpp.o -c /Users/awni/mlx/mlx/linalg.cpp
I just restarted my whole system and the cmake command start working without forcing the env variable in CMakeList file, I read a lot of stuff online my best guest is that there was some random cache over the CMAKE_APPLE_SILICON_PROCESSOR variable that set that as blank value (my cmake --system-information display as blank value before restarting) and then cascade the default value to x86_64 for the dependent variables, I suspect that some of my reinstallation attempts override the variable.
Thanks for the time dropped here @awni
Wow, the age old solution "restart" actually worked here. I'm dumbfounded, but glad it's fixed.