mlx icon indicating copy to clipboard operation
mlx copied to clipboard

Unit test Floating Point Exception

Open ccostes opened this issue 1 year ago • 12 comments

Building from source and running on an M1 I am getting a floating point exception from the Negatie slice and ascending bounds portion of the test_array.TestArray.test_slice_negative_step unit test. All tests pass when commenting the following lines out:

# Negatie slice and ascending bounds
b_np = a_np[0:20:-3]
b_mx = a_mx[0:20:-3]
self.assertTrue(np.array_equal(b_np, b_mx))

Excerpted exception details:

Exception Type:        EXC_ARITHMETIC (SIGFPE)
Exception Codes:       0x0000000000000001, 0x0000000000000000
Termination Reason:    Namespace SIGNAL, Code 8 Floating point exception: 8


Thread 0::  Dispatch queue: com.apple.main-thread
0   ???                           	    0x7ff89ac8e9a8 ???
1   libsystem_kernel.dylib        	    0x7ff80b0430fe __psynch_cvwait + 10
2   libsystem_pthread.dylib       	    0x7ff80b07f758 _pthread_cond_wait + 1242
3   libc++.1.dylib                	    0x7ff80afbc1e2 std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
4   libc++.1.dylib                	    0x7ff80afbca51 std::__1::__assoc_sub_state::__sub_wait(std::__1::unique_lock<std::__1::mutex>&) + 45
5   libc++.1.dylib                	    0x7ff80afbcaba std::__1::__assoc_sub_state::wait() + 46
6   libmlx.dylib                  	       0x10afa5648 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, bool) + 4520


Thread 11 Crashed:
0   AGXMetal13_3                  	    0x7ffa243eec5b AGX::ComputeService<AGX::G13::Encoders, AGX::G13::Classes, AGX::G13::ObjClasses, AGX::G13::EncoderComputeServiceClasses>::executeKernelWithThreadsPerGridImpl(eAGXDataBufferPools, MTLSize, MTLSize) + 363
1   AGXMetal13_3                  	    0x7ffa243e90d5 -[AGXG13GFamilyComputeContext dispatchThreads:threadsPerThreadgroup:] + 197
2   libmlx.dylib                  	       0x10b7876ae mlx::core::Arange::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 1742
3   libmlx.dylib                  	       0x10b7833cf std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array&, std::__1::vector<std::__1::shared_future<void>, std::__1::allocator<std::__1::shared_future<void>>>, std::__1::shared_ptr<std::__1::promise<void>>, bool)::$_2>, void ()>::operator()() + 143
4   libmlx.dylib                  	       0x10afa2706 mlx::core::scheduler::StreamThread::thread_fn() + 518
5   libmlx.dylib                  	       0x10afa2953 void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), mlx::core::scheduler::StreamThread*>>(void*) + 67
6   libsystem_pthread.dylib       	    0x7ff80b07f1d3 _pthread_start + 125
7   libsystem_pthread.dylib       	    0x7ff80b07abd3 thread_start + 15

Thread 11 crashed with X86 Thread State (64-bit):
  rax: 0x000000007fffffff  rbx: 0x0000000000000000  rcx: 0x0000000000000010  rdx: 0x0000000000000000
  rdi: 0x0000000000000000  rsi: 0x0000000000000001  rbp: 0x00000003028c6c60  rsp: 0x00000003028c6bf0
   r8: 0x0000000000000002   r9: 0x0000600003da15f8  r10: 0x00007ffa4dc030d8  r11: 0x00007ff814d09672
  r12: 0x00007fc888350000  r13: 0x0000000000000000  r14: 0x00000003028c6c70  r15: 0x0000000000000000
  rip: <unavailable>       rfl: 0x0000000000000203
 tmp0: 0x0000000000000000 tmp1: 0x000000007fffffff tmp2: 0x0000000000000000

ccostes avatar Jan 01 '24 20:01 ccostes

That doesn't look good. Which commit are you building on?

Also can you give some more details on your environment? E.g. OS, python version, ...

awni avatar Jan 01 '24 20:01 awni

Sure, this is on the latest commit on main, python 3.11.6, macOS 13.5 on an M1 MacBook Air.

Let me know if there's any other info that might be useful. Happy to try running any test code to help narrow things down.

ccostes avatar Jan 01 '24 20:01 ccostes

I'm not able to repro on my M1 but my situation is a bit different :. I'm on 14.2 and M1 Max.

Does running the test standalone cause the issue as well?

e.g. just running with TestArray.test_slice_negative_step

awni avatar Jan 01 '24 22:01 awni

Digging deeper, I am also getting the same exception from a few of the the C++ tests, including autograd_tests, creations_tests, and most interestingly this simple test from metal_tests:

    a = array(std::initializer_list<bool>{});
    CHECK_EQ(all(a, Device::gpu).item<bool>(), true);

The first two made it seem like a logic issue with empty slices, but the common thread seems to be array::eval on an empty array. Something that basic should be easily reproducible which points to something specific to my machine, but I have no idea what it could be.

ccostes avatar Jan 01 '24 23:01 ccostes

Wow, that's interesting..

Can you try running some of those tests standalon? Like just the test that fails? (Comment out the rest or something). I'm really curious about:

  1. Is it reproducible consistenly
  2. Does it involve some state being corrupted from other tests or is it really just that test having issues..

awni avatar Jan 01 '24 23:01 awni

Debugging just the test metal reduce test case it is reliably hitting a divide-by-zero (not floating-point as I saw in the others) exception in reduce.cpp due to thread_group_size (as well as in_size and mod_in_size) being 0.

This may be a separate issue than the floating-point exception. image

ccostes avatar Jan 06 '24 04:01 ccostes

Cool, thanks for finding that! That looks like an easy fix now that you narrowed it down. Interested in sending a PR for it?

awni avatar Jan 06 '24 04:01 awni

I added an early-exit to reduce which allowed the metal tests to pass, but now I'm seeing more failures from ops_tests (in addition to autograd and creations mentioned above). That, plus the fact that these tests are passing on your machine, makes me think there's some deeper issue here.

The common thread is that all of these failures are from operations on empty arrays. It seems surprising the answer would be adding checks for length == 0 all over the place, but I'm still unfamiliar with the library internals.

ccostes avatar Jan 06 '24 04:01 ccostes

added an early-exit to reduce which allowed the metal tests to pass

Rather than an early exit, I was thinking to check if there is a zero in the denom and set the thread group size to zero in that case.

That, plus the fact that these tests are passing on your machine, makes me think there's some deeper issue here.

This is very strange to me as well...

It seems surprising the answer would be adding checks for length == 0 all over the place, but I'm still unfamiliar with the library internals.

No that doesn't seem like a good idea. Most of our primitives can handle 0 sized inputs.

awni avatar Jan 06 '24 14:01 awni

@ccostes There is definitely a divide by zero. The fact we don't get an exception makes me think you are using a different compiler as it is generally undefined behavior / compiler specific. I can get the divide by zero error if I add:

add_compile_options(-fsanitize=integer-divide-by-zero)
link_libraries("-fsanitize=integer-divide-by-zero")

awni avatar Jan 06 '24 14:01 awni

@ccostes could you check my PR and see if it fixes your issue? I didn't see any other instances where we divide by integer 0.

#388

awni avatar Jan 06 '24 14:01 awni

I think I found the problem: Rosetta! When building I didn't notice the following:

-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
  Building for x86_64 on macOS is not supported.  If you are on an Apple
  silicon system, make sure you are building for arm64.

I had iterm set to run with Rosetta, but even when unsetting that and verifying that uname -m returns arm64 (where it was x86_64 previously) cmake is still reporting that it is building for x86_64...

% uname -m
arm64
build % cmake ..
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
  Building for x86_64 on macOS is not supported.  If you are on an Apple
  silicon system, make sure you are building for arm64.

EDIT: Turns out I was using the x86_64 version of homebrew this whole time, too! After re-installing homebrew, then cmake, and rebuilding mlx all tests are now passing!! 🎉

Sorry for the runaround, but thanks for all your help!

ccostes avatar Jan 06 '24 17:01 ccostes

I'm having a similar issue when running tests from autograd_tests.cpp.

I follow your suggestions @ccostes, but I already have my terminal running without Rosseta

% uname -p
arm

And homebrew running over ARM build

% brew config
HOMEBREW_VERSION: 4.2.3
ORIGIN: https://github.com/Homebrew/brew
HEAD: b3751bca8c4a3c76107d5aa3b75b81db93cf4bb6
Last commit: 3 days ago
Core tap JSON: 11 Jan 03:57 UTC
Core cask tap JSON: 11 Jan 03:57 UTC
HOMEBREW_PREFIX: /opt/homebrew
HOMEBREW_CASK_OPTS: []
HOMEBREW_MAKE_JOBS: 10
Homebrew Ruby: 3.1.4 => /opt/homebrew/Library/Homebrew/vendor/portable-ruby/3.1.4/bin/ruby
CPU: 10-core 64-bit arm_firestorm_icestorm
Clang: 15.0.0 build 1500
Git: 2.39.3 => /Applications/Xcode.app/Contents/Developer/usr/bin/git
Curl: 8.4.0 => /usr/bin/curl
macOS: 14.2.1-arm64
CLT: 15.1.0.0.1.1700200546
Xcode: 15.2
Rosetta 2: false

When I run the build command

% cmake .. && make -j
-- The CXX compiler identification is AppleClang 15.0.0.15000100
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
  Building for x86_64 on macOS is not supported.  If you are on an Apple
  silicon system, make sure you are building for arm64.
.....

And the tests raise and exception mlx/tests/autograd_tests.cpp:724: FATAL ERROR: test case CRASHED: SIGFPE - Floating point error signal

JhennerTigreros avatar Jan 11 '24 16:01 JhennerTigreros

Yea I don't know if the Rosetta thing in terminal actually matters, I'm having doubts about that now

But I spot your problem:

-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
  Building for x86_64 on macOS is not supported.  If you are on an Apple
  silicon system, make sure you are building for arm64.

You are building for x86 for some reason.

Could you do which cmake and see what you get?

awni avatar Jan 11 '24 16:01 awni

Thanks in advance for the time to answer me 🙏🏽

% which cmake
/usr/local/bin/cmake

I also reinstall XCode Command Line Tools and XCode Editor as well but the error continue

JhennerTigreros avatar Jan 11 '24 20:01 JhennerTigreros

That's the x86 brew cmake installation. Might just need to brew install cmake again.

On Thu, Jan 11, 2024 at 12:43 PM Jhenner Tigreros @.***> wrote:

Thanks in advance for the time to answer me 🙏🏽

% which cmake /usr/local/bin/cmake

I also reinstall XCode Command Line Tools and XCode Editor as well but the error continue

— Reply to this email directly, view it on GitHub https://github.com/ml-explore/mlx/issues/338#issuecomment-1887931715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6EBAS24LV2VFGCIQPMQLYOBFF5AVCNFSM6AAAAABBJIOVOWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXHEZTCNZRGU . You are receiving this because you were mentioned.Message ID: @.***>

ccostes avatar Jan 11 '24 20:01 ccostes

I install cmake from brew and execute the cmake command and get the same output.

% file /opt/homebrew/bin/cmake
/opt/homebrew/bin/cmake: Mach-O 64-bit executable arm64

% /opt/homebrew/bin/cmake ..
-- Building MLX for x86_64 processor on Darwin
CMake Warning at CMakeLists.txt:33 (message):
  Building for x86_64 on macOS is not supported.  If you are on an Apple
  silicon system, make sure you are building for arm64.
....

My hombrew installation is also arm64

% brew config
HOMEBREW_VERSION: 4.2.3
ORIGIN: https://github.com/Homebrew/brew
HEAD: b3751bca8c4a3c76107d5aa3b75b81db93cf4bb6
Last commit: 3 days ago
Core tap JSON: 11 Jan 03:57 UTC
Core cask tap JSON: 11 Jan 03:57 UTC
HOMEBREW_PREFIX: /opt/homebrew
HOMEBREW_CASK_OPTS: []
HOMEBREW_MAKE_JOBS: 10
Homebrew Ruby: 3.1.4 => /opt/homebrew/Library/Homebrew/vendor/portable-ruby/3.1.4/bin/ruby
CPU: 10-core 64-bit arm_firestorm_icestorm
Clang: 15.0.0 build 1500
Git: 2.39.3 => /Applications/Xcode.app/Contents/Developer/usr/bin/git
Curl: 8.4.0 => /usr/bin/curl
macOS: 14.2.1-arm64
CLT: 15.1.0.0.1.1700200546
Xcode: 15.2
Rosetta 2: false

JhennerTigreros avatar Jan 11 '24 20:01 JhennerTigreros

Can you share the output of clang -v and g++ -v ?

awni avatar Jan 11 '24 21:01 awni

Sure,

% clang -v
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
% g++ -v
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

JhennerTigreros avatar Jan 11 '24 21:01 JhennerTigreros

Lol what is going on !!!

awni avatar Jan 11 '24 21:01 awni

I have to read more about what triggers this flag CMAKE_HOST_SYSTEM_PROCESSOR as we clearly do not understand it well enough. Clearly it thinks you are building for x86 but all of your defaults look like they are for arm so I am super confused.

awni avatar Jan 11 '24 21:01 awni

On macOS, the value of uname -m is used by default. However, since this might vary based on whether you're using x86 or ARM CMake, version 3.19.2+ will use the value of CMAKE_APPLE_SILICON_PROCESSOR (either CMake or environment variable) instead, if it is set. It also normalizes Power Macintosh to powerpc.

https://stackoverflow.com/questions/70475665/what-are-the-possible-values-of-cmake-system-processor

awni avatar Jan 11 '24 21:01 awni

Can you try compiling with:

env CMAKE_APPLE_SILICON_PROCESSOR=arm64 /opt/homebrew/bin/cmake ..
make -j

awni avatar Jan 11 '24 21:01 awni

I was reading a bit about this last night, but does not find any solution to the problem. I tried to reinstall all my dependencies but doesn't work, also I try to compile with the default bash terminal.

Can you try compiling with:

env CMAKE_APPLE_SILICON_PROCESSOR=arm64 /opt/homebrew/bin/cmake ..
make -j

Gives me the same output related to building the code on x86 🤔 . If I force set env variable set(CMAKE_HOST_SYSTEM_PROCESSOR "arm64") in the CMakeLists.txt config file, it works

% cmake ..
-- Building MLX for arm64 processor on Darwin
-- Building METAL sources
-- Building with SDK for macOS version 14.2

-- Downloading json
-- Using the multi-header code from /Users/jhennertigreros/Documents/Personal/Code/mlx/build/_deps/json-src/include/
-- Accelerate found /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/System/Library/Frameworks/Accelerate.framework
CMake Deprecation Warning at build/_deps/doctest-src/CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.


-- Configuring done (10.2s)
-- Generating done (0.1s)
-- Build files have been written to: /Users/jhennertigreros/Documents/Personal/Code/mlx/build

but the building process got some errors.

% make -j
/Users/xxxx/Documents/Personal/Code/mlx/mlx/backend/accelerate/softmax.cpp:78:3: error: unknown type name 'int16x8_t'; did you mean 'int16_t'?
  int16x8_t epart = vcvtq_s16_f16(ipart);
  ^~~~~~~~~
  int16_t
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk/usr/include/sys/_types/_int16_t.h:30:33: note: 'int16_t' declared here
typedef short                   int16_t;
                                ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
make[2]: *** [CMakeFiles/mlx.dir/mlx/backend/accelerate/softmax.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/mlx.dir/all] Error 2
make: *** [all] Error 2

JhennerTigreros avatar Jan 11 '24 21:01 JhennerTigreros

Digging in my current cmake --system-informationI found that the variable has the correct value

% cmake --system-information
....
CMAKE_HOST_SYSTEM_NAME "Darwin"
CMAKE_HOST_SYSTEM_PROCESSOR "arm64"
CMAKE_HOST_SYSTEM_VERSION "23.2.0"
...

JhennerTigreros avatar Jan 12 '24 03:01 JhennerTigreros

Could you try running with make VERBOSE=1 just to see the actual compile commands. It will spit out a bunch of stuff, but just one of them would be interesting to see, something like:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -DACCELERATE_NEW_LAPACK -DMETAL_PATH="/Users/awni/mlx/build/mlx/backend/metal/kernels//mlx.metallib" -D_METAL_ -I/Users/awni/mlx/build/_deps/metal_cpp-src -I/Users/awni/mlx/build/_deps/json-src/single_include/nlohmann -I/Users/awni/mlx/build/_deps/gguflib-src -I/Users/awni/mlx -O3 -DNDEBUG -std=gnu++17 -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk -fPIC -MD -MT CMakeFiles/mlx.dir/mlx/linalg.cpp.o -MF CMakeFiles/mlx.dir/mlx/linalg.cpp.o.d -o CMakeFiles/mlx.dir/mlx/linalg.cpp.o -c /Users/awni/mlx/mlx/linalg.cpp

awni avatar Jan 12 '24 04:01 awni

I just restarted my whole system and the cmake command start working without forcing the env variable in CMakeList file, I read a lot of stuff online my best guest is that there was some random cache over the CMAKE_APPLE_SILICON_PROCESSOR variable that set that as blank value (my cmake --system-information display as blank value before restarting) and then cascade the default value to x86_64 for the dependent variables, I suspect that some of my reinstallation attempts override the variable.

Thanks for the time dropped here @awni

JhennerTigreros avatar Jan 12 '24 04:01 JhennerTigreros

Wow, the age old solution "restart" actually worked here. I'm dumbfounded, but glad it's fixed.

awni avatar Jan 12 '24 04:01 awni