Corey Lowman
Corey Lowman
Often, it is recommend to loop over items in a cuda kernel like: ```c++ for (unsigned int i = tid; i < n; i += blockDim.x * gridDim.x) { ......
With #767 now added, we can integration test against more complex image networks! This new test should mirror the existing resnet18 integration test, but with a mobile net structure.
Related to #672 Tensors could be backed by allocations that have more space than necessary. BTreeMap already has a method to return keys within a certain range that would make...
Currently we run the lhs & rhs kernels on separate streams, but due to kernel occupancy, they can't actually run in parallel: ![image](https://user-images.githubusercontent.com/7787278/231512269-83ad03f5-66f7-493c-ac4a-37621e2e7d65.png) Investigate ways to make these run in...
I think it might be possible to do inference forwards without any allocations. This would require the following: 1. Need to be able to compute the max output size of...
Currently you can only concatenate two tensors with support for different shapes in both tensors. When all the tensors in the concat operation have the same shape, we should also...
If the tensor has a NoneTape, it can only accept 1 closure, but if the tensor has a OwnedTape, it should accept both the f closure and the derivative closure...
Often times when neural networks output labels, we want to get the top K labels. This will be useful for validation & deployment. UX should look something like: ```rust let...
As brought up in reddit thread, a OpenCL device would be useful to support folks with AMD gpus. Here are roughly the tasks that need to happen: 1. [ ]...