mlx 2 tests failed for 0.0.6.dev20231223+f91f450

I just did a git pull and got 0.0.6.dev20231223+f91f450. Then I did:

env CMAKE_BUILD_PARALLEL_LEVEL="" pip install -e .

pip install ".[testing]" python -m unittest discover python/tests

And I got failures:

====================================================================== FAIL: test_log_cosh_loss (test_nn.TestNN)

Traceback (most recent call last): File ".../mlx/python/tests/test_nn.py", line 600, in test_log_cosh_loss self.assertEqual(loss, 0.433781) AssertionError: array(0.433781, dtype=float32) != 0.433781

====================================================================== FAIL: test_scans (test_ops.TestOps)

Traceback (most recent call last): File ".../mlx/python/tests/test_ops.py", line 1307, in test_scans self.assertTrue(mx.array_equal(c1[:, :, :-1], c2[:, :, 1:])) AssertionError: array(false, dtype=bool) is not true

Ran 193 tests in 30.447s

FAILED (failures=2)

Dec 23 '23 16:12 kechan

Could you share your OS?

Dec 23 '23 16:12 awni

For test_nn.py line 600, this test appeared to have 3 problems:

the loss returned an array of shape [], is this intended?
using .item(), like in pytorch, make it return a float. If (1) is intended, then this should be done and it is missing from the test
A strict equality test for a float may not be robust. Suggest using np.isclose or mx equivalent.

I am just starting out and maybe get some real project running, likely will stay away from the bleeding edge a bit.

but if mlx stepped in, I would be happy to help fix, as motivation to get up to speed.

Dec 23 '23 16:12 kechan

Hmm, the same test failed for me also. Curious that the CI didn't catch it..

I don't see the scan failure. Is that one consistently failing for you?

Dec 23 '23 16:12 awni

@awni

OS: 13.5 (22G74)

and yes, I know I should start upgrading.

Dec 23 '23 16:12 kechan

Hmm, the same test failed for me also. Curious that the CI didn't catch it..

I don't see the scan failure. Is that one consistently failing for you?

I only ran the test once before and it worked. that was before I did git pull today and it was last week I believed. Back then, I have run linear regression and MLP tutorial (MNIST) and they seemed ok.

Dec 23 '23 16:12 kechan

ok, I don't know mlx did probably some op overloading, looks like you can compare with "==" for an array with shape [] and just a float. but since this stuff is float32, the test did a dicey thing using equality check. I did this instead and it worked:

np.allclose(loss, 0.433781)

so there likely not a big issue, it is a precision type mismatch kind of failure

but not sure about the 2nd test that failed, will check it out later.

Dec 23 '23 16:12 kechan

Yes the second one is more concerning for me. I can't reproduce it, but I'm hoping its a consistent failure for you o/w there may be some nondeterminism which always makes things trickier

Dec 23 '23 16:12 awni

Yes the second one is more concerning for me. I can't reproduce it, but I'm hoping its a consistent failure for you o/w there may be some nondeterminism which always makes things trickier

I just reran:

python -m unittest discover python/tests

and indeed, this is not deterministic. I now only failed:

mlx/python/tests/test_nn.py", line 600, in test_log_cosh_loss self.assertEqual(loss, 0.433781) AssertionError: array(0.433781, dtype=float32) != 0.433781

Ran 193 tests in 23.375s

FAILED (failures=1)

Dec 23 '23 16:12 kechan

ok, those related bunch of tests probably all suffered from this nondeterminism, I just ran another one and this time, something else failed (diff than before):

Traceback (most recent call last): File "mlx/python/tests/test_ops.py", line 1318, in test_scans self.assertTrue(mx.array_equal(c1, c2)) AssertionError: array(false, dtype=bool) is not true

Dec 23 '23 17:12 kechan

to add more context if you want to debug with: my hardware is M2 Max with 96gb.

Dec 23 '23 17:12 kechan

I found that the prob of failures will go up if I am running some model inference code in a VSCode jupyter notebook, such that all GPUs were apparently fully engaged. Just speculate this may be happening if gpu load or mem are under stress?

if I don't have such workload, I actually am not able to reproduce the test_ops.py once.

Dec 26 '23 23:12 kechan

Interesting. Is it the same test that's failing for you? The slice test?

Namely:

Traceback (most recent call last):
File ".../mlx/python/tests/test_ops.py", line 1307, in test_scans
self.assertTrue(mx.array_equal(c1[:, :, :-1], c2[:, :, 1:]))
AssertionError: array(false, dtype=bool) is not true

Dec 27 '23 00:12 awni

Interesting. Is it the same test that's failing for you? The slice test?

Namely:

Traceback (most recent call last):
File ".../mlx/python/tests/test_ops.py", line 1307, in test_scans
self.assertTrue(mx.array_equal(c1[:, :, :-1], c2[:, :, 1:]))
AssertionError: array(false, dtype=bool) is not true

Its a diff one.

====================================================================== FAIL: test_scans (test_ops.TestOps)

Traceback (most recent call last): File .../mlx/python/tests/test_ops.py", line 1318, in test_scans self.assertTrue(mx.array_equal(c1, c2)) AssertionError: array(false, dtype=bool) is not true

Dec 27 '23 00:12 kechan

CC @angeloskath looks like we might have some non-deterministic failure in the scan

@kechan do you see this when your run the tests standalone or just running the full python test suite?

Thanks for you help investigating it!

Dec 27 '23 00:12 awni

FWIW I am able to get the test on line 1302 to fail somewhat regularly. I inspected the differences and it looks like it's a numerical tolerance thing, usually fails for indices deeper in the scan and the differences are small.

For example these two fail with atol=1e-4, rtol=1e-4:

np=-0.988324761390686
mlx=-0.9881017208099365

Dec 27 '23 01:12 awni

It never fails if I change the tolerances to 1e-3..

Dec 27 '23 01:12 awni

It never fails if I change the tolerances to 1e-3..

I tried "python -m unittest test_ops.py" a few times without that heavy background workload, and its always ok now:

Ran 76 tests in 0.879s

OK

1e-3 looks like a very big tolerance. Since it passed 100% when I don't have that inference workload, it just doesn't smell right. Although I don't know details of the tests, they seemed to involve cumulative math ops? which probably will surface precision issues much more sensitively. Also, I am able to reproduce very regularly (almost certainty) if I have heavy model inference on the MPS/GPU running in a jupyter notebook

Dec 27 '23 01:12 kechan

Since it passed 100% when I don't have that inference workload, it just doesn't smell right.

This was not the case for me. It fails about 50% of the time with no heavy workload 🤔

Although I don't know details of the tests, they seemed to involve cumulative math ops? which probably will surface precision issues much more sensitively.

Yes it's a cumulative sum that's failing. Which is an operation which is pretty sensitive to the order of accumulation (which will be quite different for the parallel version).

1e-3 looks like a very big tolerance.

Yea, we should do some back of the envelope analysis to determine what's reasonable here. It might be too big, I'm not sure. The way it fails looks more like an accumulation of rounding errors (or some other systematic issue). It's often just under the tolerance and often just over, but never way under.

Dec 27 '23 01:12 awni