CUDA-Learn-Notes
CUDA-Learn-Notes copied to clipboard
🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

佬有测试过 0x09 softmax 中的 `__threadfence()`吗?这个好像没办法达到grid级别线程之间的同步.
1. `sum = warp_reduce_sum(sum);` 2. `if(warp==0) sum = warp_reduce_sum(sum);` 0x03 warp/block reduce sum/max 、0x09 softmax, softmax + vec4 做final sum的时候,用的是第一种形式 0x04 block all reduce + vec4 而用的是第二种形式 我的理解是,最后final sum的时候是不是应该用第二种形式?最后都集中在第一个warp束中。 感谢!
Saw the mention of GELU in the issue, so I worked on it. There is no GELU implementation in torch for half precision (the reason is explained in readme.md), so...
## TODO - [ ] swish kernel - [ ] gelu kernel - [ ] RoPE kernel - [x] pack elementwise_add - [x] pack sigmoid - [x] pack relu -...
readme里面layer norm的实现是不是batch norm的啊