mlx
mlx copied to clipboard
Test passes from command line, but fails in Xcode
When I run tests from the command line with make test
then I receive a report that all tests have passed.
However, when I attempt to run tests from within Xcode, the very first test fails due to the test's attempt to allocate a zero-length buffer.
It seems surprising to me that the test's attempt to allocate a zero-length buffer would sometimes be acceptable and sometimes not be acceptable.
Also, it seems that it's not possible to progress to the second test when the first test fails. Instead, every time that I press the Continue button, Xcode simply reports the same exception again. The exception seems not to be thrown up to the try/catch block in doctest.h.
The exception is being thrown by the debug MTL gpu device. I'd assume that the debug device would perform stricter error checking than the release device. Is it possible that when running make test
the release device is used?
I'll attempt to run tests in Xcode against a release build.
If I filter the tests to run only the argmin and argmax edge case tests which are in arg_reduce_tests.cpp, then I experience the same error which I had reported in my own code in a different issue (https://github.com/ml-explore/mlx/issues/214).
Debugging a bit more, it seems that the logic in primitives.cpp is requesting that MTL set a zero-length buffer:
Although the MTL docs don't say so, I'm guessing that the debug MTL device performs extra validation which decides to not create a buffer when the request is for a zero-length buffer.
https://developer.apple.com/documentation/metal/mtlcomputecommandencoder/1443159-setbytes
My hunch about this being a difference in the verification performed by different MTL devices (debug versus release) seems to be wrong. That shape
buffer is not an unused buffer which might have been receiving unwanted verification by the debug MTL device. Instead, that shape buffer is actually being used in arg_reduce.metal, and in the elem_to_loc function which is ultimately called by the kernel.
At this point, I'm out of ideas.
@awni I'm very much looking forward to reading your analysis of how to solve this problem so that I can get unblocked.
@awni I just noticed that shape
never actually gets used in elem_to_loc
for argmax
or argmin
because ndim
is always zero for those two functions, so the loop condition in elem_to_loc
never evaluates to true.
So, it actually is possible that the failures which I reported are due to verification of buffer arguments being performed by a debug MTL device that is not being performed by the MTL device which is being used when running make test
.
I'm attempting a test with a release build in Xcode now to confirm...
@awni my hunch about MTL verification was correct. It turns out that there's a setting in Xcode to enable / disable MTL validation. If I have that setting enabled, then my code which calls MLX's argmax fails. If I disable that setting, then my code succeeds.
I'm guessing that make test
runs without MTL validation, so the tests all pass. It would be interesting to enable MTL validation when running those tests to see how many tests fail.
@dougdew64 sorry I somehow missed this thread 😓 . There is an issue filed #410 on this exact problem. I think the warnings in this case are about benign issues, but will still plan to get validation to pass for future Xcode users.
This is fixed.