Bug in tensor core programming
I encountered a strange bug while programming tensor core using the WMMA api in A800. I tried to print the size of the element in the fragment,Normally sizeof(fp16) is 2, the following code also outputs 2.
wmma::load_matrix_sync(a_frag, a , lda);
wmma::load_matrix_sync(b_frag, b , ldb);
wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
printf("%d\n", sizeof(a_frag.x[0]));
printf("%d\n", sizeof(a_frag.x[1]));
However, I changed the code to the following form, the print statement output 2 and 0, even i changed the order of a_frag.x[0] and a_frag.x[1], the output still is 2 and 0, Does anyone know why?
wmma::load_matrix_sync(a_frag, a , lda);
wmma::load_matrix_sync(b_frag, b , ldb);
wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
printf("%d %d\n", sizeof(a_frag.x[0]), sizeof(a_frag.x[1]));
My code runs in the following environment, os is ubuntu 20.04.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A800 80GB PCIe Off | 00000000:03:00.0 Off | 0 |
| N/A 43C P0 68W / 300W | 2768MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
My compile command is:
nvcc tensor_core.cu -std=c++11 -lcublas -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -res-usage -lcudart -lineinfo -Xcompiler -fopenmp
Hello @blueWatermelonFri,
The format specifier %d expects an argument of type int, not size_t. The correct version of the code is
printf("%d %d\n", (int)sizeof(a_frag.x[0]), (int)sizeof(a_frag.x[1]));
I guess the reason for the output of 2 and 0 may be that size_t takes 8 bytes, i.e., 0x00000002, so 0x0002 takes the place of the first %d, and 0x0000 takes the place of the second %d.
Additionally, you may have received a warning like:
warning #181-D: argument is incompatible with corresponding format string conversion (expected type "int" but argument has type "unsigned long long")