Justin Turney
Justin Turney
That's a good point about streaming. Should look into supporting multiple GPUs and using NVLink for communication.
The current implementation of your batched gemm (using a vector of tensors) can't utilize the batched versions of gemm found in BLAS implementations.
Is there much of a difference between the container batch and the block/tiles tensor batching?
Does C = 2.0 A - B also work? The mix and matching.
Yeah, I've run into this a lot. I never made the connection of it being `x`.