llm.c Changes toward `layernorm_forward` in `dev/cuda`

Remove cooperative groups

Following the instructions in #292, remove cooperative groups codes in existing layernorm forward kernels.

benchmark

Performance before and after changes:

	Block Size	`layernorm_forward3`	`layernorm_forward4`	`layernorm_forward5`
With cgs	32	`0.1373 ms, 366.49 GB/s`	`0.1381 ms, 364.35 GB/s`	`0.1732 ms, 290.67 GB/s`
	64	`0.1473 ms, 341.73 GB/s`	`0.1421 ms, 354.32 GB/s`	`0.1348 ms, 373.39 GB/s`
	128	`0.1776 ms, 283.32 GB/s`	`0.1519 ms, 331.27 GB/s`	`0.1293 ms, 389.28 GB/s`
	256	`0.1678 ms, 300.02 GB/s`	`0.1493 ms, 337.01 GB/s`	`0.1459 ms, 344.90 GB/s`
	512	`0.1704 ms, 295.34 GB/s`	`0.1498 ms, 336.09 GB/s`	`0.2136 ms, 235.61 GB/s`
	1024	`0.1475 ms, 341.28 GB/s`	`0.1441 ms, 349.33 GB/s`	`0.4137 ms, 121.65 GB/s`
Without cgs	32	`0.1382 ms, 364.23 GB/s`	`0.1366 ms, 368.52 GB/s`	`0.1715 ms, 293.44 GB/s`
	64	`0.1466 ms, 343.30 GB/s`	`0.1397 ms, 360.29 GB/s`	`0.1356 ms, 371.27 GB/s`
	128	`0.1759 ms, 286.09 GB/s`	`0.1502 ms, 335.11 GB/s`	`0.1326 ms, 379.48 GB/s`
	256	`0.1675 ms, 300.56 GB/s`	`0.1484 ms, 339.12 GB/s`	`0.1411 ms, 356.80 GB/s`
	512	`0.1674 ms, 300.70 GB/s`	`0.1487 ms, 338.50 GB/s`	`0.2315 ms, 217.41 GB/s`
	1024	`0.1437 ms, 350.18 GB/s`	`0.1445 ms, 348.27 GB/s`	`0.4223 ms, 119.18 GB/s`

New implementation: `layernorm_forward6`

Similar to kernel4, plus using shared memory to acclerate data accessing. Due to the smem limit in my gpu, the kernel cannot handle block size > 512 (with fixed C 768 and warp size 32).

Benchmark

OS: Ubuntu 22.04 Device: NVIDIA GeForce RTX 3070 Laptop GPU

Format: (time, bandwidth)

Block size	`layernorm_forward4()`	`layernorm_forward_5()`	`layernorm_forward6()`
32	`0.1376 ms, 365.86 GB/s`	`0.1735 ms, 290.03 GB/s`	`0.1268 ms, 396.90 GB/s`
64	`0.1418 ms, 355.04 GB/s`	`0.1349 ms, 373.13 GB/s`	`0.1271 ms, 395.86 GB/s`
128	`0.1515 ms, 332.21 GB/s`	`0.1304 ms, 385.93 GB/s`	`0.1268 ms, 396.88 GB/s`
256	`0.1479 ms, 340.38 GB/s`	`0.1386 ms, 363.08 GB/s`	`0.1267 ms, 397.23 GB/s`
512	`0.1493 ms, 337.06 GB/s`	`0.2123 ms, 237.10 GB/s`	`0.1283 ms, 392.31 GB/s`
1024	`0.1435 ms, 350.81 GB/s`	`0.4182 ms, 120.35 GB/s`	`0.1290 ms, 390.16 GB/s` (still 512 here)

Jun 15 '24 03:06 KarhouTam

Hi, this looks ok to me.

the coding style in kernel6 is a bit off, e.g. i think you're using black? the cropping of the lines and such looks inconsistent with the rest of the code.
am i reading this right that kernel6 is fairly fast? would it make sense to incorporate it into prod?
cc @ngc92 or @ademeure to also possibly take a look

Jun 16 '24 16:06 karpathy

Hi, thanks for reviewing this PR.

the coding style in kernel6 is a bit off, e.g. i think you're using black? the cropping of the lines and such looks inconsistent with the rest of the code.

I used clang-format and the .clang-format is based on google code style. It's ok to re-format the codes (by myself or others). But I don't have llm.c's format rule file.

would it make sense to incorporate it into prod?

My machine doesn't have that much computaional resource to do the training. So please do this test for me. If it cannot be used in production, rolling back commits is acceptable.

Jun 16 '24 16:06 KarhouTam

the coding style in kernel6 is a bit off, e.g. i think you're using black? the cropping of the lines and such looks inconsistent with the rest of the code.

I have tried to restore the code style of former kernels. Does it look better now?

Jun 16 '24 16:06 KarhouTam

Hi, @karpathy.

I've tried to run profile_gpt2.cu to profile currently used layernorm_forward kernel (layernorm_forward_kernel3) in layernorm.cuh and my kernel (layernorm_forward_kernel4).

512 block size (current)

full-ncu-results

My kernel performs not that good when block size is 512. Since my gpu is 3070 laptop (CC 8.6) and smem is not enough, thus the bottleneck may be smem capacity.

Brief results:

256 block size

full-ncu-results But my kernel and the current performs better with block size 256. So maybe the block size set in layernorm_forward() is not the best option.

Brief results:

Jun 17 '24 02:06 KarhouTam

Notice that the name kernel6 is taken, thus rename the kernel to kernel7.

Meanwhile combines the merits of kernel4 and kernel6. kernel7 achieves similar performance to kernel6 and kernel4 while using less shared memory than kernel6.

benchmark

`layernorm_forward_kernel6`

block_size   32 | time 0.1288 ms | bandwidth 390.77 GB/s
block_size   64 | time 0.1282 ms | bandwidth 392.56 GB/s
block_size  128 | time 0.1284 ms | bandwidth 391.89 GB/s
block_size  256 | time 0.1275 ms | bandwidth 394.67 GB/s
block_size  512 | time 0.1282 ms | bandwidth 392.51 GB/s
block_size 1024 | time 0.3942 ms | bandwidth 127.68 GB/s

P.S. the poor performance of block size 1024 attributes to the lack of smem of my gpu.

`layernorm_forward_kernel7`

block_size   32 | time 0.1285 ms | bandwidth 391.67 GB/s
block_size   64 | time 0.1271 ms | bandwidth 396.08 GB/s
block_size  128 | time 0.1271 ms | bandwidth 395.85 GB/s
block_size  256 | time 0.1271 ms | bandwidth 396.12 GB/s
block_size  512 | time 0.1281 ms | bandwidth 392.83 GB/s
block_size 1024 | time 0.1283 ms | bandwidth 392.16 GB/s

Jun 18 '24 03:06 KarhouTam

Hi @KarhouTam we just merged a LayerNorm forward, I'm not 100% sure how this version is similar or different now.

Jun 18 '24 03:06 karpathy

Hi, @karpathy. Notice that kernel6 use smem to store data from input, weight and bias.

extern __shared__ char params[];
x128* s_weight = reinterpret_cast<x128*>(params);
x128* s_bias = reinterpret_cast<x128*>(params) + (C / x128::size);
x128* s_in = reinterpret_cast<x128*>(params) + ((2 + threadIdx.y) * C / x128::size);

In the process of layernorm forward, the weight and bias data are only used at the end (calculate the final result). So I think that is unnecessary to store them by smem the beginning.

smem size of kernel6: (2 + block_y) * C * sizeof(float), where blocky = block_size / WARP_SIZE

smem size of kernel7: C * sizeof(float) * (block_size / WARP_SIZE)

So in summary, kernel7 reduce smem usage amount by 2 * C * sizeof(float)

And kernel6 doesn't use formula var = (mean2 - mean1 * mean1). kernel6 runs 3 for-loops and kernel7 only runs 2. But I can't figure it out that why kernel7 runs less loops but has similar performance to kernel6. Maybe the overload is not as big as I think.

Jun 18 '24 03:06 KarhouTam

Changes toward `layernorm_forward` in `dev/cuda`

Remove cooperative groups

benchmark

New implementation: layernorm_forward6

Benchmark

512 block size (current)

256 block size

benchmark

layernorm_forward_kernel6

layernorm_forward_kernel7

New implementation: `layernorm_forward6`

`layernorm_forward_kernel6`

`layernorm_forward_kernel7`