Justin Turney

Results 35 comments of Justin Turney

That's a good point about streaming. Should look into supporting multiple GPUs and using NVLink for communication.

The current implementation of your batched gemm (using a vector of tensors) can't utilize the batched versions of gemm found in BLAS implementations.

Is there much of a difference between the container batch and the block/tiles tensor batching?

Does C = 2.0 A - B also work? The mix and matching.

Yeah, I've run into this a lot. I never made the connection of it being `x`.