llm.c
llm.c copied to clipboard
LLM training in simple, raw C/CUDA
This is a faster version of the cool new kernel from #117 (still /dev/cuda/ only). The biggest difference is it is optimised for doing one row per 1024-wide block rather...
This doesn't help us as is, but going forward, its a first step towards padding the vocab dimension to a sane value that actually allows for fast implementations. I haven't...
const keyword additions and one new file in the platform/windows directory called unistd.h.
Currently we only ever call `gpt2_forward` function with a single, fixed setting of `B,T`, for both training and inference, e.g.: ```c gpt2_forward(&model, gen_tokens, NULL, B, T); ``` However, in principle...
I'm working on the C version of the code in preparation for (#40) So llm.c with **no** code modifications I observe the following: - `test_gpt2` works successfully and the loss...
Hi! Sometimes it's tricky or daunting to set up the hardware and environment for a script like this on a self-hosted cloud GPU. I was trying out llm.c on a...
I didn't see the implementation of backpropagation code in the train_gpt2.cu file, How does it compute gradients? ```c // do a training step clock_gettime(CLOCK_MONOTONIC, &start); dataloader_next_batch(&train_loader); gpt2_forward(&model, train_loader.inputs, train_loader.targets, B,...
A larger `thread_reuse_factor` reduces the number of threads launched while increasing the per-thread load. Depending on the value of `B * T * OC` and the GPU card, it is...
 Why do I encounter this problem? My cpu device is 13900kf,memory is 32GB.
How much GPU Ram do I need? I tried training on my GTX 1650 with 4GB or RAM. Batch size is already 4 meaning that's going to be difficult to...