mlx Test passes from command line, but fails in Xcode

When I run tests from the command line with make test then I receive a report that all tests have passed.

However, when I attempt to run tests from within Xcode, the very first test fails due to the test's attempt to allocate a zero-length buffer.

It seems surprising to me that the test's attempt to allocate a zero-length buffer would sometimes be acceptable and sometimes not be acceptable.

Dec 19 '23 14:12 dougdew64

Also, it seems that it's not possible to progress to the second test when the first test fails. Instead, every time that I press the Continue button, Xcode simply reports the same exception again. The exception seems not to be thrown up to the try/catch block in doctest.h.

Dec 19 '23 14:12 dougdew64

The exception is being thrown by the debug MTL gpu device. I'd assume that the debug device would perform stricter error checking than the release device. Is it possible that when running make test the release device is used?

I'll attempt to run tests in Xcode against a release build.

Dec 19 '23 15:12 dougdew64

If I filter the tests to run only the argmin and argmax edge case tests which are in arg_reduce_tests.cpp, then I experience the same error which I had reported in my own code in a different issue (https://github.com/ml-explore/mlx/issues/214).

Dec 19 '23 15:12 dougdew64

Debugging a bit more, it seems that the logic in primitives.cpp is requesting that MTL set a zero-length buffer:

Dec 20 '23 17:12 dougdew64

Although the MTL docs don't say so, I'm guessing that the debug MTL device performs extra validation which decides to not create a buffer when the request is for a zero-length buffer.

https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes

Dec 20 '23 17:12 dougdew64

My hunch about this being a difference in the verification performed by different MTL devices (debug versus release) seems to be wrong. That shape buffer is not an unused buffer which might have been receiving unwanted verification by the debug MTL device. Instead, that shape buffer is actually being used in arg_reduce.metal, and in the elem_to_loc function which is ultimately called by the kernel.

At this point, I'm out of ideas.

@awni I'm very much looking forward to reading your analysis of how to solve this problem so that I can get unblocked.

Dec 20 '23 18:12 dougdew64

@awni I just noticed that shape never actually gets used in elem_to_loc for argmax or argmin because ndim is always zero for those two functions, so the loop condition in elem_to_loc never evaluates to true.

So, it actually is possible that the failures which I reported are due to verification of buffer arguments being performed by a debug MTL device that is not being performed by the MTL device which is being used when running make test.

Dec 20 '23 19:12 dougdew64

I'm attempting a test with a release build in Xcode now to confirm...

Dec 20 '23 19:12 dougdew64

@awni my hunch about MTL verification was correct. It turns out that there's a setting in Xcode to enable / disable MTL validation. If I have that setting enabled, then my code which calls MLX's argmax fails. If I disable that setting, then my code succeeds.

I'm guessing that make test runs without MTL validation, so the tests all pass. It would be interesting to enable MTL validation when running those tests to see how many tests fail.

Dec 20 '23 19:12 dougdew64

@dougdew64 sorry I somehow missed this thread 😓 . There is an issue filed #410 on this exact problem. I think the warnings in this case are about benign issues, but will still plan to get validation to pass for future Xcode users.

Jan 10 '24 05:01 awni

This is fixed.

Mar 06 '24 15:03 awni

mlx mlx copied to clipboard

Test passes from command line, but fails in Xcode

mlx
mlx copied to clipboard