Shao Tang
Shao Tang
Minor fix
1. Move vec_at to common.h in dev/cuda 2. clean up the nested bracket in classifer_fused.cu
Due to the `void setup_main()` in the common.h, ``` // setup cuBLAS and cuBLASLt cublasCheck(cublasCreate(&cublas_handle)); cublasCheck(cublasLtCreate(&cublaslt_handle)); cudaCheck(cudaMalloc(&cublaslt_workspace, cublaslt_workspace_size)); ``` added ` -lcublas -lcublasLt` in the comment for compile for consistency....
[dev/cuda] Include a matmul_backward_bias kernel based on PMPP CoarsenedSumReduction kernel in 10.15
This kernel could be a demonstration of leveraging the PMPP materials in practice. Modification of  The performance of the kernel depends on ``` const int coarse_factor const int block_size_y...
The constexpr keyword allows the function to be evaluated at compile time if the arguments are known at compile time, which can eliminate the function call overhead.
1. Based on the formula and the actual computation code, the `B` and `eps` are *not* used in `def _layer_norm_bwd_dx_fused` and are thus removed for clarity. 2. Some other minor...
Based on `@pytest.mark.parametrize("Z, H, N_CTX, HEAD_DIM", [(1, 2, 1024, 64)])` `BATCH, N_HEADS, HEAD_DIM = 4, 32, 64` the HEAD_DIM is `64` in both pytest and benchmark, which triggers assertion failure...
Due to the reduction in line https://github.com/karpathy/llm.c/blob/master/dev/cuda/matmul_backward_bias.cu#L67 The block size needs to be the power of 2 for the `kernel 1`. Otherwise the GPU result is wrong: ``` Using kernel...