nimlgen
nimlgen
The following code `_ = ((x@y)+(a@b)).numpy()` can benefit from fusing it into one kernel. It was measured to get about 10gflops (~195 -> ~205). There are several cases where reduceops...
This is one more implementation of cuda graphs baed on 2 previous MR. This is a first MR, so has some basic functionality to create CUDAgraphs + setting some dynamic...
graph is multidevice, quite not huge perf impact (test hlb, 2gpus). need to enqueue transfers as well to get the speed
This seems to be faster for small sizes
I think we can do something like that, should be faster: no allocation, no copies