cloud11665
cloud11665
installing pycuda and compilation now take up most of the time :)
Installing deps takes so long because of pycuda which is not installed in any other test. As for moving that into core tinygrad, should it live in ops_cuda or some...
!!!  sub 8 minutes !
Does the CI config file count, as that alone is about 50 ?
simply casting true and false branch of a ternary statement got rid of the issue. also passes `test/test_dtype.py::TestHalfDtype::test_int8_matmul_upcast_half`. Only test left is the stupid int8 -> uint8 saturation test/test_dtype.py .....................................F........
Oops, was also casting to float4, and given it's implementation quirks it broke the opencl tests.
I'm getting 1100mspt on a 3090 with `JIT=1 OPT=4 OPTLOCAL=2` with cuda and 180mspt with opencl. It's not looking too good for cuda atm, I'll have to investigate it further,...
should the half4 stuff be inlined into CUDAProgram then ? Imo it's not worth it as that'd make the output much more noisy. On the other hand, we could solve...
is this tiny enough ?
oh, but wasn't the ignoring of casts a known issue ?