llm.c
llm.c copied to clipboard
Error: make: *** [Makefile:203: train_gpt2cu] Error 255
Environment:
- System: Ubuntu 22.04.2 LTS
- CUDA Version: cuda_12.1.r12.1/compiler.32688072_0
- nvcc: 12.1
I encounter an error when I execute:
make train_gpt2cu
Warring and error message:
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/lustre/apps/apps/cuda/cuda-12.1/bin/nvcc -O3 -t=0 --use_fast_math -DMULTI_GPU -DENABLE_BF16 train_gpt2.cu -lcublas -lcublasLt -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include -lmpi -lnccl -o train_gpt2cu
train_gpt2.cu(284): warning #20012-D: __device__ annotation is ignored on a function("Packed128") that is explicitly defaulted on its first declaration
__attribute__((device)) Packed128() = default;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
train_gpt2.cu(201): error: identifier "__ushort_as_bfloat16" is undefined
__nv_bfloat162 add_val = (ptr_val & 0x3) ? __halves2bfloat162(__ushort_as_bfloat16(0), val)
^
train_gpt2.cu(201): error: identifier "__halves2bfloat162" is undefined
__nv_bfloat162 add_val = (ptr_val & 0x3) ? __halves2bfloat162(__ushort_as_bfloat16(0), val)
^
train_gpt2.cu(203): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (__nv_bfloat162 *, __nv_bfloat162)
atomicAdd(ptr_bf16, add_val);
^
train_gpt2.cu(242): error: no operator "+=" matches these operands
operand types are: floatX += float
val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
^
train_gpt2.cu(284): warning #20012-D: __device__ annotation is ignored on a function("Packed128") that is explicitly defaulted on its first declaration
__attribute__((device)) Packed128() = default;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
train_gpt2.cu(608): error: no instance of overloaded function "__stcs" matches the argument list
argument types are: (floatX *, floatX)
__stcs(mean + idx, (floatX)m);
^
train_gpt2.cu(620): error: no instance of overloaded function "__stcs" matches the argument list
argument types are: (floatX *, floatX)
__stcs(rstd + idx, (floatX)s);
^
train_gpt2.cu(629): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
float n = s * ((float)__ldcs(x+c) - m);
^
train_gpt2.cu(630): error: no instance of overloaded function "__stcs" matches the argument list
argument types are: (floatX *, floatX)
__stcs(o+c, (floatX)(n * (float)weight[c] + (float)bias[c]));
^
train_gpt2.cu(650): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
q[idx] = __ldcs(&inp[inp_idx]);
^
train_gpt2.cu(651): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
k[idx] = __ldcs(&inp[inp_idx + NH * d]);
^
train_gpt2.cu(652): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
v[idx] = __ldcs(&inp[inp_idx + 2 * (NH * d)]);
^
train_gpt2.cu(688): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (floatX *)
out[other_idx] = __ldcs(&inp[idx]);
^
train_gpt2.cu(769): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
float ev = expf(inv_temperature * ((float)__ldcs(x + i) - global_maxval));
^
train_gpt2.cu(770): error: no instance of overloaded function "__stcs" matches the argument list
argument types are: (floatX *, floatX)
__stcs(out + idx * T + i, (floatX)(ev * norm));
^
train_gpt2.cu(924): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
float dout_i = (float)__ldcs(&dout_bt[i]);
^
train_gpt2.cu(925): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
float norm_bti = ((float)__ldcs(&inp_bt[i]) - mean_bt) * rstd_bt;
^
train_gpt2.cu(996): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
float acc = (float)__ldcs(att_bth + t3) * ((float)__ldcs(datt_bth + t3) - local_sum);
^
train_gpt2.cu(996): error: no instance of overloaded function "__ldcs" matches the argument list
argument types are: (const floatX *)
float acc = (float)__ldcs(att_bth + t3) * ((float)__ldcs(datt_bth + t3) - local_sum);
^
train_gpt2.cu(997): error: no instance of overloaded function "__stcs" matches the argument list
argument types are: (floatX *, floatX)
__stcs(dpreatt_bth + t3, (floatX)(scale * acc));
^
train_gpt2.cu(1135): error: no operator "+=" matches these operands
operand types are: floatX += floatX
if (i < n) { dst[i] += (floatX)src[i]; }
^
train_gpt2.cu(80): warning #177-D: variable "ncclFloatN" was declared but never referenced
const ncclDataType_t ncclFloatN = ncclFloat;
^
20 errors detected in the compilation of "train_gpt2.cu".
make: *** [Makefile:203: train_gpt2cu] Error 255
This problem or question might seem kind of stupid since I'm a beginner in CUDA and C. I would appreciate it if anyone could provide me with some solutions or suggestions.
you need to either disable BF16 ( -DENABLE_BF16
) or instruct your compiler to compile for a more recent GPU (Ampere) that actually has hardware support for bf16
similar ERROR
---------------------------------------------
→ cuDNN is manually disabled by default, run make with `USE_CUDNN=1` to try to enable
✓ OpenMP found
✓ OpenMPI found, OK to train with multiple GPUs
✓ nvcc found, including GPU/CUDA support
---------------------------------------------
/usr/local/cuda/bin/nvcc -O3 -t=0 --use_fast_math -DMULTI_GPU -DENABLE_FP16 train_gpt2.cu -lcublas -lcublasLt -L/usr/lib/x86_64-linux-gnu/openmpi/lib/ -I/usr/lib/x86_64-linux-gnu/openmpi/include -lmpi -lnccl -o train_gpt2cu
train_gpt2.cu(215): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (half2 *, half2)
train_gpt2.cu(242): error: no operator "+=" matches these operands
operand types are: floatX += __half
train_gpt2.cu(284): warning #20012-D: __device__ annotation is ignored on a function("Packed128") that is explicitly defaulted on its first declaration
train_gpt2.cu(1135): error: no operator "+=" matches these operands
operand types are: floatX += floatX
train_gpt2.cu(80): warning #177-D: variable "ncclFloatN" was declared but never referenced
3 errors detected in the compilation of "train_gpt2.cu".
make: *** [Makefile:203: train_gpt2cu] Error 255
Try upgrading your Cuda version to 12.4.1?
- upgrade nvcc to 12.4.
- check the computation capability of the GPU card, in the source code include/cuda_bf16.h (or hpp). You might see
#if defined(__CUDACC__) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800) || defined(_NVHPC_CUDA))
This basically means functions are not available for computation capability <8.0..
Note, the header source is dependent on cuda tool kit version. Things that cannot be compiled in 12.1 may be compilable in 12.4 (this is the case for me).
By default, PRECISION=BF16
.
make
# It is the same as:
PRECISION=BF16 make
Compile with other options can also solve this issue.
PRECISION=FP16 make
# or
PRECISION=FP32 make
Related code in Makefile
:
# Precision settings, default to bf16 but ability to override
PRECISION ?= BF16
VALID_PRECISIONS := FP32 FP16 BF16
ifeq ($(filter $(PRECISION),$(VALID_PRECISIONS)),)
$(error Invalid precision $(PRECISION), valid precisions are $(VALID_PRECISIONS))
endif
ifeq ($(PRECISION), FP32)
PFLAGS = -DENABLE_FP32
else ifeq ($(PRECISION), FP16)
PFLAGS = -DENABLE_FP16
else
PFLAGS = -DENABLE_BF16
endif
upgrade nvcc to 12.4.
- check the computation capability of the GPU card, in the source code include/cuda_bf16.h (or hpp). You might see
#if defined(__CUDACC__) && (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 800) || defined(_NVHPC_CUDA))
This basically means functions are not available for computation capability <8.0..
Note, the header source is dependent on cuda tool kit version. Things that cannot be compiled in 12.1 may be compilable in 12.4 (this is the case for me).
This solved my issue when I saw the error on a V100 GPU (AWS P3 instance). Updating to CUDA 12.5 fixed the make error.
Got it with cuda 12.4