accelerated-scan icon indicating copy to clipboard operation
accelerated-scan copied to clipboard

Accelerated First Order Parallel Associative Scan

Results 9 accelerated-scan issues
Sort by recently updated
recently updated
newest added

Running the triton implementation with torch 2.2 on inputs of type float16 and bfloat16 result in the following error: ``` File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None,...

@sustcsonglin has suggested that float accumulation might improve stability of the implementation. The current test I'm trying using to see this is: ``` python -m pytest tests -s -v -k...

@proger Awesome work! Always appreciate the wonderful contributions of OSS advancing the frontiers of research. I know you've done a number of experiments comparing various scan implementations in your other...

[Feng et al.](https://arxiv.org/abs/2410.01201) proposed a log-space implementation of parallel scan for improved numerical stability. It should be fairly easy to implement, but I'm a bit out of practice with my...

Hi, I was wondering if there are any plans to make the cuda code be compatible with complex numbers? This would be particularly helpful given that triton does not currently...

Introducing a kernel for training a [fast weight programmer](http://proceedings.mlr.press/v139/schlag21a/schlag21a.pdf) by backpropagating through the [delta rule](https://www-isl.stanford.edu/~widrow/papers/c1960adaptiveswitching.pdf) (online linear regression) with @ischlag. Improving on top of first order recurrence with scalar hidden...

Thank you for your excellent work! I was wondering if it’s possible to modify your code to handle a state-space model case where the gates (A matrix) have a more...

Warp kernel crashes for some input data in fp16 and bf16. E.g. ``` [B C T ] [2, 2, 32768] -- works [4, 2, 32768] -- doesn't [2, 4, 32768]...