Shao Tang issues

Results 33 issues of


                                            Shao Tang

Minor update on the classifier_fused.cu

Minor fix

Move vec_at to common.h in dev/cuda

1. Move vec_at to common.h in dev/cuda 2. clean up the nested bracket in classifer_fused.cu

Update the -lcublas -lcublasLt flag in the comment

Due to the `void setup_main()` in the common.h, ``` // setup cuBLAS and cuBLASLt cublasCheck(cublasCreate(&cublas_handle)); cublasCheck(cublasLtCreate(&cublaslt_handle)); cudaCheck(cudaMalloc(&cublaslt_workspace, cublaslt_workspace_size)); ``` added ` -lcublas -lcublasLt` in the comment for compile for consistency....

fix a small typo

[dev/cuda] Include a matmul_backward_bias kernel based on PMPP CoarsenedSumReduction kernel in 10.15

This kernel could be a demonstration of leveraging the PMPP materials in practice. Modification of ![image](https://github.com/karpathy/llm.c/assets/7495155/08a08477-b321-468b-a46f-09b8de8d741f) The performance of the kernel depends on ``` const int coarse_factor const int block_size_y...

Make ceil_div constexpr and inline

The constexpr keyword allows the function to be evaluated at compile time if the arguments are known at compile time, which can eliminate the function call overhead.

[TUTORIAL] Minor update and clean up the tutorial05

1. Based on the formula and the actual computation code, the `B` and `eps` are *not* used in `def _layer_norm_bwd_dx_fused` and are thus removed for clarity. 2. Some other minor...

[TUTORIALS] Update link and path in tutorial 04

[TUTORIALS] Minor fix for tutorial 06

Based on `@pytest.mark.parametrize("Z, H, N_CTX, HEAD_DIM", [(1, 2, 1024, 64)])` `BATCH, N_HEADS, HEAD_DIM = 4, 32, 64` the HEAD_DIM is `64` in both pytest and benchmark, which triggers assertion failure...

Fix the unsupported block_size in matmul_backward_bias kernel 1

Due to the reduction in line https://github.com/karpathy/llm.c/blob/master/dev/cuda/matmul_backward_bias.cu#L67 The block size needs to be the power of 2 for the `kernel 1`. Otherwise the GPU result is wrong: ``` Using kernel...