llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: Sporadic MUL_MAT Failures in test-backend-ops for Nvidia backend

Open ShanoToni opened this issue 2 days ago • 1 comments

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes register_backend: registered backend CUDA (1 devices) register_device: registered device CUDA0 (NVIDIA A100-PCIE-40GB) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz) version: 4667 (d2fe216f) built with gcc (GCC) 12.2.0 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Test code

Command line

`./bin/test-backend-ops`

Problem description & steps to reproduce

Test failure was encountered while running MUL_MAT trough test-backend-ops.

  • The failing mulmat configuration was identified as MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]) Test case created Here
  • Failures seemed random, consecutive runs of test-backend-ops did not reproduce the error. Modifying the test-backend-ops.cpp by adding the mul_mat test case 1000 times was able to reproduce the failing test consistently (At least a few out of the 1000 cases would fail)
// Example of adding failing mul_mat case
    for (int i = 0; i < 1000; i++) {
        test_cases.emplace_back(new test_mul_mat(GGML_TYPE_Q5_1, GGML_TYPE_F32, 16, 1, 256, {1,  1}, {1, 1}));
    }
  • The test fails due to NMSE being over the maximum error threshold.
  • Example error output:
  MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431
  • The nvidia backend seems to convert the src1 to a Q8_1 type and then run mul_mat with inputs Q5_1 and Q8_1. Could this be causing the precision issue?

  • The largest encountered NMSE from 20000 runs was identified as 0.001409

  • Is the loss of precision expected to this degree? The max error for the mul_mat tests is set to 5e-4. Should this be modified?

First Bad Commit

Due to the sporadic nature of the test failure, the commit (d2fe216f) was the first one where the failure was encountered, and currently the origin is not identified. Latest commit that was tested and error was reproduced is (4806498b)

Relevant log output

MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.000508874 > 0.000500000     
    0  0.948417  1.035245, diff = -0.086828
    1 -2.924956 -2.844111, diff = -0.080845
    2 -1.777758 -1.695090, diff = -0.082667
    3  0.450649  0.537106, diff = -0.086457
    4 -4.114096 -4.030904, diff = -0.083191
    5 -0.682358 -0.596930, diff = -0.085428
    6 -8.252451 -8.167437, diff = -0.085014
    7 -0.692235 -0.606851, diff = -0.085384
    8 -5.382234 -5.304606, diff = -0.077628
    9  3.467584  3.552903, diff = -0.085320
   10 -7.941753 -7.861615, diff = -0.080138
   11  3.101702  3.186424, diff = -0.084722
   12  0.954475  1.037351, diff = -0.082876
   13  2.353770  2.437956, diff = -0.084186
   14 -1.223359 -1.139174, diff = -0.084185
   15  0.853322  0.939753, diff = -0.086431

ShanoToni avatar Feb 20 '25 13:02 ShanoToni