Corey Lowman issues

Results 78 issues of


Corey Lowman

Using gridDim.x in reduction (max/min/sum) cuda kernels

Often, it is recommend to loop over items in a cuda kernel like: ```c++ for (unsigned int i = tid; i < n; i += blockDim.x * gridDim.x) { ......

gpu

optimization

expert

Add integration test for mobile net

With #767 now added, we can integration test against more complex image networks! This new test should mirror the existing resnet18 integration test, but with a mobile net structure.

test

Add bf16 Dtype

Related to #696

new feature

Allow TensorCache to return allocations that are bigger than necessary

Related to #672 Tensors could be backed by allocations that have more space than necessary. BTreeMap already has a method to return keys within a certain range that would make...

optimization

expert

Make binary backward kernels run in parallel

Currently we run the lhs & rhs kernels on separate streams, but due to kernel occupancy, they can't actually run in parallel: ![image](https://user-images.githubusercontent.com/7787278/231512269-83ad03f5-66f7-493c-ac4a-37621e2e7d65.png) Investigate ways to make these run in...

gpu

optimization

Zero allocation inference forward

I think it might be possible to do inference forwards without any allocations. This would require the following: 1. Need to be able to compute the max output size of...

ideation

optimization

help wanted

gpu

expert

Corey Lowman

Using gridDim.x in reduction (max/min/sum) cuda kernels

Add integration test for mobile net

Add bf16 Dtype

Allow TensorCache to return allocations that are bigger than necessary

Make binary backward kernels run in parallel

Zero allocation inference forward

Add ability to concat `Vec<Tensor<S>>` or `[Tensor<S>; N]`

Add `Tensor.map` with rust closures only for Cpu device

Add `Tensor::top_k`

Add OpenCL device