2 tests failed for 0.0.6.dev20231223+f91f450
I just did a git pull and got 0.0.6.dev20231223+f91f450. Then I did:
env CMAKE_BUILD_PARALLEL_LEVEL="" pip install -e .
pip install ".[testing]" python -m unittest discover python/tests
And I got failures:
====================================================================== FAIL: test_log_cosh_loss (test_nn.TestNN)
Traceback (most recent call last): File ".../mlx/python/tests/test_nn.py", line 600, in test_log_cosh_loss self.assertEqual(loss, 0.433781) AssertionError: array(0.433781, dtype=float32) != 0.433781
====================================================================== FAIL: test_scans (test_ops.TestOps)
Traceback (most recent call last): File ".../mlx/python/tests/test_ops.py", line 1307, in test_scans self.assertTrue(mx.array_equal(c1[:, :, :-1], c2[:, :, 1:])) AssertionError: array(false, dtype=bool) is not true
Ran 193 tests in 30.447s
FAILED (failures=2)
Could you share your OS?
For test_nn.py line 600, this test appeared to have 3 problems:
- the loss returned an array of shape [], is this intended?
- using .item(), like in pytorch, make it return a float. If (1) is intended, then this should be done and it is missing from the test
- A strict equality test for a float may not be robust. Suggest using np.isclose or mx equivalent.
I am just starting out and maybe get some real project running, likely will stay away from the bleeding edge a bit.
but if mlx stepped in, I would be happy to help fix, as motivation to get up to speed.
Hmm, the same test failed for me also. Curious that the CI didn't catch it..
I don't see the scan failure. Is that one consistently failing for you?
@awni
OS: 13.5 (22G74)
and yes, I know I should start upgrading.
Hmm, the same test failed for me also. Curious that the CI didn't catch it..
I don't see the scan failure. Is that one consistently failing for you?
I only ran the test once before and it worked. that was before I did git pull today and it was last week I believed. Back then, I have run linear regression and MLP tutorial (MNIST) and they seemed ok.
ok, I don't know mlx did probably some op overloading, looks like you can compare with "==" for an array with shape [] and just a float. but since this stuff is float32, the test did a dicey thing using equality check. I did this instead and it worked:
np.allclose(loss, 0.433781)
so there likely not a big issue, it is a precision type mismatch kind of failure
but not sure about the 2nd test that failed, will check it out later.
Yes the second one is more concerning for me. I can't reproduce it, but I'm hoping its a consistent failure for you o/w there may be some nondeterminism which always makes things trickier
Yes the second one is more concerning for me. I can't reproduce it, but I'm hoping its a consistent failure for you o/w there may be some nondeterminism which always makes things trickier
I just reran:
python -m unittest discover python/tests
and indeed, this is not deterministic. I now only failed:
mlx/python/tests/test_nn.py", line 600, in test_log_cosh_loss self.assertEqual(loss, 0.433781) AssertionError: array(0.433781, dtype=float32) != 0.433781
Ran 193 tests in 23.375s
FAILED (failures=1)
ok, those related bunch of tests probably all suffered from this nondeterminism, I just ran another one and this time, something else failed (diff than before):
Traceback (most recent call last): File "mlx/python/tests/test_ops.py", line 1318, in test_scans self.assertTrue(mx.array_equal(c1, c2)) AssertionError: array(false, dtype=bool) is not true
to add more context if you want to debug with: my hardware is M2 Max with 96gb.
I found that the prob of failures will go up if I am running some model inference code in a VSCode jupyter notebook, such that all GPUs were apparently fully engaged. Just speculate this may be happening if gpu load or mem are under stress?
if I don't have such workload, I actually am not able to reproduce the test_ops.py once.
Interesting. Is it the same test that's failing for you? The slice test?
Namely:
Traceback (most recent call last):
File ".../mlx/python/tests/test_ops.py", line 1307, in test_scans
self.assertTrue(mx.array_equal(c1[:, :, :-1], c2[:, :, 1:]))
AssertionError: array(false, dtype=bool) is not true
Interesting. Is it the same test that's failing for you? The slice test?
Namely:
Traceback (most recent call last): File ".../mlx/python/tests/test_ops.py", line 1307, in test_scans self.assertTrue(mx.array_equal(c1[:, :, :-1], c2[:, :, 1:])) AssertionError: array(false, dtype=bool) is not true
Its a diff one.
====================================================================== FAIL: test_scans (test_ops.TestOps)
Traceback (most recent call last): File .../mlx/python/tests/test_ops.py", line 1318, in test_scans self.assertTrue(mx.array_equal(c1, c2)) AssertionError: array(false, dtype=bool) is not true
CC @angeloskath looks like we might have some non-deterministic failure in the scan
@kechan do you see this when your run the tests standalone or just running the full python test suite?
Thanks for you help investigating it!
FWIW I am able to get the test on line 1302 to fail somewhat regularly. I inspected the differences and it looks like it's a numerical tolerance thing, usually fails for indices deeper in the scan and the differences are small.
For example these two fail with atol=1e-4, rtol=1e-4:
np=-0.988324761390686
mlx=-0.9881017208099365
It never fails if I change the tolerances to 1e-3..
It never fails if I change the tolerances to 1e-3..
I tried "python -m unittest test_ops.py" a few times without that heavy background workload, and its always ok now:
Ran 76 tests in 0.879s
OK
1e-3 looks like a very big tolerance. Since it passed 100% when I don't have that inference workload, it just doesn't smell right. Although I don't know details of the tests, they seemed to involve cumulative math ops? which probably will surface precision issues much more sensitively. Also, I am able to reproduce very regularly (almost certainty) if I have heavy model inference on the MPS/GPU running in a jupyter notebook
Since it passed 100% when I don't have that inference workload, it just doesn't smell right.
This was not the case for me. It fails about 50% of the time with no heavy workload 🤔
Although I don't know details of the tests, they seemed to involve cumulative math ops? which probably will surface precision issues much more sensitively.
Yes it's a cumulative sum that's failing. Which is an operation which is pretty sensitive to the order of accumulation (which will be quite different for the parallel version).
1e-3 looks like a very big tolerance.
Yea, we should do some back of the envelope analysis to determine what's reasonable here. It might be too big, I'm not sure. The way it fails looks more like an accumulation of rounding errors (or some other systematic issue). It's often just under the tolerance and often just over, but never way under.