Corey Lowman

Results 78 issues of Corey Lowman

Currently only support 1d arrays/vecs

new feature

pytorch doesn't actually calculate input gradient of Conv2d unless `requires_grad==True`, whereas dfdx currently always updates the input gradient. This is actually a giant speed boost, if you comment out the...

ideation

This is an optimization that further minimizes the amount of data needed for certain cases of binary operations. For example if you add two tensors that both have the 0th...

See pytorch docs https://pytorch.org/docs/stable/generated/torch.nn.Softplus.html#torch.nn.Softplus Will need to figure out what the derivative of this function is, but that's probably online somewhere. See https://github.com/coreylowman/dfdx/pull/397 for all the pieces you'll need to...

See pytorch docs https://pytorch.org/docs/stable/generated/torch.nn.Hardswish.html#torch.nn.Hardswish Will need to figure out what the derivative of this function is, but that's probably online somewhere. See https://github.com/coreylowman/dfdx/pull/397 for all the pieces you'll need to...

See thread in https://github.com/coreylowman/dfdx/pull/256#discussion_r1008886075 Currently AddInto only properly records gradients if all inputs have a tape on them. However now that add will merge tapes, it would be nice if...

TODO: - [ ] Generate bindings for cudnn - [ ] Genearte bindings for nccl

> Okay, after too much time investigating. It seems the Cuda Driver (not NCCL) is using a global MUTEX which makes multithread/multigpu quite useless. https://forums.developer.nvidia.com/t/cuda-wont-concurrently-run-kernels-on-multiple-devices-from-within-same-process/240388 https://forums.developer.nvidia.com/t/multithreaded-tensorrt-performance-drops-dramatically/184882/8 https://forums.developer.nvidia.com/t/cuda-introduces-heavy-locks/61357 > > Basically...

Using https://doc.rust-lang.org/unstable-book/language-features/abi-ptx.html, you can compile rust functions to PTX on nightly compiler. How can we use this with cudarc?