mlx icon indicating copy to clipboard operation
mlx copied to clipboard

[BUG] bus error

Open francescofarina opened this issue 1 year ago • 22 comments

Describe the bug When attempting to run tests, either for C++ or Python I get a generic bus error. It may as well be a local problem on my machine but it'd be great to rule that out as I've tried to re-install from scratch and the problem persists.

To Reproduce C++: cd build && make test

Python: python -m unittest discover python/tests

Expected behavior A clear and concise description of what you expected to happen.

Desktop:

  • OS Version: MacOS 14.2.1
  • Version: 0.0.7 (currently at 6ea6b42)

Additional context

  • Xcode: 15.1
  • cmake: 3.28.1
  • Chip: M3 Pro

francescofarina avatar Jan 07 '24 17:01 francescofarina

🤔 I do not see a bus error but I have some differences in my setup:

  • Tried the commit you pointed (same)
  • OS 14.2.1 (same as you)
  • Chip M1 Max (possibly something odd with M3 pro..)
  • Xcode 15.1 (same as you)
  • cmake version 3.24.2 (seems unlikely to be an issue)

awni avatar Jan 07 '24 17:01 awni

I wiped the build and tried with cmake 3.28.1 and I still see no issue.

So either there is something else is off with your env or we have an issue with m3 pro.

awni avatar Jan 07 '24 17:01 awni

Ok, interesting.

Let's see if someone else with an M3 Pro can give it a try.

francescofarina avatar Jan 07 '24 17:01 francescofarina

Is it just the tests? Are you able to import mlx and do any ops?

What if you try from PyPi? Does that package work for you?

Honestly i would be very surprised if it had to do with the M3 as I know MLX is being used regularly there (unless we did something recently since release 0.0.7).

awni avatar Jan 07 '24 17:01 awni

Yes, I am. I'm trying to figure out whether there's some operation that's causing the problem with no luck yet.

francescofarina avatar Jan 07 '24 17:01 francescofarina

So the tril op seems to consistently cause a bus error. I tested it manually and also all the tests up to here work https://github.com/ml-explore/mlx/blob/449b43762e3f970576f054e54066123c0f37246e/python/tests/test_ops.py#L331 this one is the first to cause a bus error (in test_ops.py).

Simply running

import mlx.core as mx

mx.tril(mx.zeros([1]))

results in a bus error.

francescofarina avatar Jan 07 '24 17:01 francescofarina

Wow that is so strange..

I get this: ValueError: [tril] array must be at least 2-D

awni avatar Jan 07 '24 17:01 awni

Is it just that case? Can you do mx.tril(mx.ones((10, 10)))?

awni avatar Jan 07 '24 17:01 awni

Also what's your Pybind11 version? python -c "import pybind11; print(pybind11.__version__)"

awni avatar Jan 07 '24 17:01 awni

Is it just that case? Can you do mx.tril(mx.ones((10, 10)))?

That works actually! Looks like the problem may be with raising exceptions/throwing errors - the test I pointed out above is the first one to expect an exception.

francescofarina avatar Jan 07 '24 17:01 francescofarina

python -c "import pybind11; print(pybind11.__version__)"

2.11.1

francescofarina avatar Jan 07 '24 17:01 francescofarina

But you see the same problem from C++ only right? So it seems unlikely to be a binding issue 🤔

Can you also see the output of:

  • uname -m (should be arm)
  • How did you install cmake? Via brew which cmake?

I have seen some funny issues when using Rosetta to do the x86 translation so want to be sure everything is arm / native.

awni avatar Jan 07 '24 17:01 awni

But you see the same problem from C++ only right? So it seems unlikely to be a binding issue 🤔

Yes that's the weird part.

Can you also see the output of:

  • uname -m

arm64

  • How did you install cmake? Via brew which cmake?

via brew: /opt/homebrew/bin/cmake

francescofarina avatar Jan 07 '24 17:01 francescofarina

FWIW this is what I get when running make test

Test project <path_to_mlx>/build
    Start 1: tests
1/1 Test #1: tests ............................Bus error***Exception:   0.04 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   0.05 sec

The following tests FAILED:
	  1 - tests (Bus error)
Errors while running CTest
Output from these tests are in: <path_to_mlx>/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
make: *** [test] Error 8

francescofarina avatar Jan 07 '24 17:01 francescofarina

I get this: ValueError: [tril] array must be at least 2-D

I am also getting this error. I am working on M2.

gboduljak avatar Jan 07 '24 20:01 gboduljak

Thanks @gboduljak. I'm admittedly a little stumped by this one. Would be good to check on an M3 Pro to verify that has nothing to do with it.. (@jagrit06 might be able to help there).

awni avatar Jan 07 '24 20:01 awni

FYI I tried to uninstall and reinstall Xcode, cmake and gcc and then clone the clean repo, build and test

git clone [email protected]:ml-explore/mlx.git mlx && cd mlx
mkdir -p build && cd build
cmake .. && make -j
make test

Still getting the same bus error reported above.

francescofarina avatar Jan 07 '24 20:01 francescofarina

There is a simple test case in this thread. Maybe if you have a second you can play around with it and see if it also gives you a bus error. That would strongly suggest something is borked in your environment.

awni avatar Jan 08 '24 00:01 awni

There is a simple test case in this thread. Maybe if you have a second you can play around with it and see if it also gives you a bus error. That would strongly suggest something is borked in your environment.

That actually works fine for me.

For reproducibility: I tried compiling via

/usr/bin/clang++ -shared -stdlib=libc++ -std=c++17 -undefined dynamic_lookup $(python3 -m pybind11 --includes) test.cpp -o test$(python3-config --extension-suffix)

and

c++ -shared -std=c++17 -undefined dynamic_lookup $(python3 -m pybind11 --includes) test.cpp -o test$(python3-config --extension-suffix)

And I always get the correct output when running the python test.

Note: with mlx I get a bus error not a segmentation fault.

francescofarina avatar Jan 08 '24 11:01 francescofarina

@jagrit06 ran on an M3 Max and cannot repro the Bus error (there were some numerical issues which are fixed in #401) but unrelated to the bus error :\

awni avatar Jan 08 '24 17:01 awni

Thanks @awni, then unless there's some unlikely difference between M3 Pro and Max I guess this is a local issue. I'll continue to explore what's the cause and post any solution in case someone else faces the same problem at some point.

francescofarina avatar Jan 08 '24 18:01 francescofarina

I am exceedingly curious..

awni avatar Jan 08 '24 18:01 awni

@francescofarina can we close this? Were you ever able to build / run the tests locally?

awni avatar Mar 06 '24 15:03 awni

Yes! I'm still not sure what the problem was but it works well now.

francescofarina avatar Mar 06 '24 15:03 francescofarina

Glad to hear it!

awni avatar Mar 06 '24 15:03 awni