Tri Dao
Tri Dao
Sure, I'll create a docker image and/or Colab this weekend. I'm a bit swamped with deadline until Friday.
Thank you @MatthieuTPHR, super exited to see ideas on fast & memory-efficient attention having an impact!
> I tried to compare [my code](https://gist.github.com/buttercutter/b3331ca1fd9e2f5871b0eded6b758f39) with [your code](https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py) as well as [@johnma2006 's code](https://github.com/johnma2006/candle/blob/main/candle/models/mamba/model.py#L195) line-by-line, taking three code files in perspective, there seems to be no successful findings...
Yup, it's only implemented for CUDA for now. You can look at the `selective_scan_ref` for the pure pytorch implementation that should run on CPU (though probably quite slow).
You can use dropout, just like Transformers. It's not implemented here but you can add it.
Are you sure it's from this repo? I did a search for "Cauchy" and found nothing.
I think the warning is from the s4 repo.
I think i've seen [it](https://github.com/HazyResearch/flash-attention/issues/21). I haven't figured out the cause, but I think it's some combination of gcc version and nvcc version.
I think I've fixed the error "internal compiler error: in maybe_undo_parenthesized_ref" with this [commit](https://github.com/HazyResearch/flash-attention/commit/8a2ece89f7bd5d3124a6cae5fd95db5e85f07ee6) in the flash-attention repo.
You'd probably want to write a MambaClassifierHeadModel that has a similar structure: a Mamba model backbone with a classifier head.