reenable apex backward / phantom grad tests without wrecking CI
I'm disabling the apex tests because they use excessively much memory. Not sure whether this is the tests or a bug in the executor. I'd be happy to see these re-enabled either with lower memory consumption or executed without parallelism.
test cuda memory use test_apex_cross_entropy_backward[cuda-float16] 4 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-bfloat16] 5 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-float32] 6 memory 2.872036933898926
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float16] 7 memory 8.822380542755127
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-bfloat16] 8 memory 9.014111042022705
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float32] 9 memory 9.401249408721924
cc @borda @crcrpar
What do these numbers mean? What are the units? How much GPU memory is acceptable to use in a test?
The numbers are GB GPU mem. We can run tests needing 8GB, but we should not run them in the parallel setup. For comparison in #394 I listed all tests (of 7000) that need > 0.6GB GPU memory.
It would be cool if someone (else) could look into whether we need that much memory for softmax testing. If so, we can move the calls of apex / triton crossentropy tests to the "network tests" section of the GPU running (or do this via tagging).
triage review —
- these tests probably don't need to use so much memory, and the amount can be reduced
- if the memory is needed, we can mark them to be executed serially