CUDA-Learn-Notes issues

Results 6 CUDA-Learn-Notes issues

Sort by recently updated

trafficstars

resources

![cuda-learn-note](https://github.com/DefTruth/CUDA-Learn-Note/assets/31974251/882271fe-ab60-4b0e-9440-2e0fa3c0fb6f)

DefTruth

stale

__threadfence() 作用

佬有测试过 0x09 softmax 中的 `__threadfence()`吗?这个好像没办法达到grid级别线程之间的同步.

zbt78

1. `sum = warp_reduce_sum(sum);` 2. `if(warp==0) sum = warp_reduce_sum(sum);` 0x03 warp/block reduce sum/max 、0x09 softmax, softmax + vec4 做final sum的时候，用的是第一种形式 0x04 block all reduce + vec4 而用的是第二种形式我的理解是，最后final sum的时候是不是应该用第二种形式？最后都集中在第一个warp束中。感谢！

Ss-shuang123

[GELU] Add f32/x4, f16/x2/x8/x8pack kernel.

Saw the mention of GELU in the issue, so I worked on it. There is no GELU implementation in torch for half precision (the reason is explained in readme.md), so...

bear-zd

Kernel Trace issue

## TODO - [ ] swish kernel - [ ] gelu kernel - [ ] RoPE kernel - [x] pack elementwise_add - [x] pack sigmoid - [x] pack relu -...

DefTruth

layer norm实现

readme里面layer norm的实现是不是batch norm的啊

zbt78

CUDA-Learn-Notes
CUDA-Learn-Notes copied to clipboard

Metadata

resources

__threadfence() 作用

您好，请教一个关于代码中reduce相关的问题

[GELU] Add f32/x4, f16/x2/x8/x8pack kernel.

Kernel Trace issue

layer norm实现

← Metadata

Owner

Metadata

CUDA-Learn-Notes CUDA-Learn-Notes copied to clipboard

Metadata

resources

__threadfence() 作用

您好，请教一个关于代码中reduce相关的问题

[GELU] Add f32/x4, f16/x2/x8/x8pack kernel.

Kernel Trace issue

layer norm实现

← Metadata

Owner

Metadata

CUDA-Learn-Notes
CUDA-Learn-Notes copied to clipboard