Tri Dao
Tri Dao
Does the machine you compile on have a GPU? Does it have access to the CUDA compiler (`nvcc`)? Or are you just using the CPU version?
Thanks, let me try to reproduce on my end.
1. The CUDA implementation processes 5 or 7 "layers" in one kernel call for efficiency (hence the "max5" part). 2. I'm surprised that the python overhead is that high, but...
Are you running this from an interactive session (e.g. ipython)? If so, relative import (`from .blah import blah`) doesn't work. This is an issue with Python import in general (https://realpython.com/absolute-vs-relative-python-imports/)....
The butterfly matrix isn't explicitly generated (as that would use O(n^2) space and takes O(n^2) time to multiply by a vector). Instead, we provide a `Butterfly` class that stores O(n...
You're right, it's 2n log n parameters. Remember to pass in `tied_weight=False`, like `Butterfly(1024, 1024, tied_weight=False)`. The version with tied weights was used in an earlier paper, where we only...
If you're asking about how to compute the gradient wrt to the weight: 1. In the pure Pytorch implementation, we rely on Pytorch's autodifferentiation to compute gradients. 2. In the...
It can be used with any differentiable loss function. In back-propagration/reverse mode autodiff, for each layer, we are given the incoming error/grad from layers above, and use that to compute...
We only use ray to run the experiments (parallel runs and hyperparameter optimization). The butterfly class (in the `butterfly` directory) doesn't require ray.
It might be an error on Ray's side. Can you check that you can run Ray's quick start example (https://github.com/ray-project/ray)? Which OS are you using? We've only tested on Linux...