CUDA-Learn-Notes icon indicating copy to clipboard operation
CUDA-Learn-Notes copied to clipboard

🎉 Modern CUDA Learn Notes with PyTorch: fp32, fp16, bf16, fp8/int8, flash_attn, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.

Results 6 CUDA-Learn-Notes issues
Sort by recently updated
recently updated
newest added
trafficstars

![cuda-learn-note](https://github.com/DefTruth/CUDA-Learn-Note/assets/31974251/882271fe-ab60-4b0e-9440-2e0fa3c0fb6f)

stale

佬有测试过 0x09 softmax 中的 `__threadfence()`吗?这个好像没办法达到grid级别线程之间的同步.

1. `sum = warp_reduce_sum(sum);` 2. `if(warp==0) sum = warp_reduce_sum(sum);` 0x03 warp/block reduce sum/max 、0x09 softmax, softmax + vec4 做final sum的时候,用的是第一种形式 0x04 block all reduce + vec4 而用的是第二种形式 我的理解是,最后final sum的时候是不是应该用第二种形式?最后都集中在第一个warp束中。 感谢!

Saw the mention of GELU in the issue, so I worked on it. There is no GELU implementation in torch for half precision (the reason is explained in readme.md), so...

## TODO - [ ] swish kernel - [ ] gelu kernel - [ ] RoPE kernel - [x] pack elementwise_add - [x] pack sigmoid - [x] pack relu -...

readme里面layer norm的实现是不是batch norm的啊