Why do I encounter a "Bad magic model file" error when running train_gpt2fp32cu?

Open GGGrin8 opened this issue 3 months ago • 0 comments

When I followed the steps outlined in the ReadMe:

quick start (1 GPU, fp32 only)

If you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were "checkpointed" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this:

chmod u+x ./dev/download_starter_pack.sh
./dev/download_starter_pack.sh
make train_gpt2fp32cu
./train_gpt2fp32cu

I encountered the following error:

klzhu@DESKTOP-39BBGJS:~/llm.c-master$ make train_gpt2fp32cu

→ cuDNN is manually disabled by default, run make with USE_CUDNN=1 to try to enable ✓ OpenMP found ✗ NCCL is not found, disabling multi-GPU support ---> On Linux you can try install NCCL with sudo apt install libnccl2 libnccl-dev ✗ MPI not found ✓ nvcc found, including GPU/CUDA support

/usr/local/cuda/bin/nvcc --threads=0 -t=0 --use_fast_math -std=c++17 -O3 train_gpt2_fp32.cu -lcublas -lcublasLt -lnvidia-ml -o train_gpt2fp32cu train_gpt2_fp32.cu(62): warning #550-D: variable "cublas_compute_type" was set but never used static cublasComputeType_t cublas_compute_type; ^

Remark: The warnings can be suppressed with "-diag-suppress "

Sep 09 '25 15:09 GGGrin8