ncnn 使用vulkan进行推理时结果不正确

error log | 日志或报错信息 | ログ

context | 编译/运行环境 | バックグラウンド

windows11

how to reproduce | 复现步骤 | 再現方法

使用cpu进行推理时, 结果是正常的, 但是使用gpu推理时,返回的结果是错误的, 我将运行时的所有的blob输出, 结果发现经过第一个卷积层后输出就不一样了

more | 其他 | その他

主要代码如下:


    bool useGpu = true;
    bool useDebugParam = true;

    SetConsoleOutputCP(CP_UTF8);
    LOG_I("开始 face detection test...");

    // 文件路径配置
    std::string param_path;
    if (useDebugParam) {
        param_path = R"(D:\tmp\ncnn_pytorch\face_detector.ncnn_debug.param)";
    } else {
        param_path = R"(D:\tmp\ncnn_pytorch\face_detector.ncnn.param)";
    }
    std::string bin_path = R"(D:\tmp\ncnn_pytorch\face_detector.ncnn.bin)";

    std::string original_img_path = R"(D:\tmp\image\o\face_image_1080_1920.png)";
    std::string padded_image_save_path = R"(D:\tmp\image\face_detector_ncnn_padded.png)"; // 你可以修改为所需路径

    std::string output_img_path = R"(D:\tmp\image\face_detector_ncnn.png)";
    std::string original_with_detection_output_img_path = R"(D:\tmp\image\face_detector_ncnn_with_original.png)";

    // 加载图像
    cv::Mat originalImg = cv::imread(original_img_path, cv::IMREAD_UNCHANGED);
    if (originalImg.empty()) {
        LOG_E("图片未找到: %s", original_img_path.c_str());
        return -1;
    }

    // 转换通道：如果图像有 4 通道，转换为 RGB；否则从 BGR 转换为 RGB
    if (originalImg.channels() == 4) {
        LOG_D("COLOR_BGRA2RGB");
        cv::cvtColor(originalImg, originalImg, cv::COLOR_BGRA2RGB);
    } else {
        LOG_D("COLOR_BGR2RGB");
        cv::cvtColor(originalImg, originalImg, cv::COLOR_BGR2RGB);
    }

    // 1. letterbox处理后得到 padded 图像，尺寸为 128x128，格式为 RGB
    PaddingParams padding_params{};
    cv::Mat padded = letterbox_padding(originalImg, cv::Size(128, 128), padding_params);

    ncnn::Mat mat_in;
    cv::Mat padded_float;

    if (useDebugParam) {
        mat_in = ncnn::Mat::from_pixels(padded.data, ncnn::Mat::PIXEL_RGB, padded.cols, padded.rows);
        const float norm_vals[3] = {1 / 255.f, 1 / 255.f, 1 / 255.f};
        mat_in.substract_mean_normalize(0, norm_vals);
        mat_in.dims = 4;
    }else {
        padded.convertTo(padded_float, CV_32FC3, 1.0 / 255.0);
        mat_in = ncnn::Mat(3, 128, 128, 1, padded_float.data);
    }
    print_ncnn_mat_shape(mat_in, "mat_in");

    ncnn::Net net;
    if (useGpu) {
        int gpu_count = ncnn::get_gpu_count();
        LOG_D("gpu_count:%d", gpu_count);
        if (gpu_count <= 0) {
            LOG_E("gpu_count<=0");
            return -1;
        }

        LOG_D("use_vulkan_compute");
        net.opt.use_vulkan_compute = true;

        // set specified vulkan device before loading param and model
        // net.set_vulkan_device(0); // use device-0

        net.opt.use_fp16_packed = false;
        net.opt.use_fp16_storage = false;
        net.opt.use_fp16_arithmetic = false;
        net.opt.use_int8_storage = false;
        net.opt.use_int8_arithmetic = false;
    }

    LOG_I("load_param: %s", param_path.c_str());
    if (net.load_param(param_path.c_str()) != 0) {
        LOG_E("加载 param 文件失败");
        return -1;
    }
    LOG_I("load_model: %s", bin_path.c_str());
    if (net.load_model(bin_path.c_str()) != 0) {
        LOG_E("加载 bin 文件失败");
        return -1;
    }

    ncnn::Extractor ex = net.create_extractor();
    // 设置输入节点名称为 "in0"
    LOG_D("ex.input");
    ex.input("in0", mat_in);

    // 执行推理，提取输出 "out0" 和 "out1"
    LOG_D("ex.extract");
    ncnn::Mat regressors, scores;
    ex.extract("out0", regressors);
    ex.extract("out1", scores);
    print_ncnn_mat_shape(regressors, "regressors");
    print_ncnn_mat_shape(scores, "scores");

    int num_regressors = regressors.w * regressors.h * regressors.c; // 896*16
    int num_scores = scores.w * scores.h * scores.c; // 896

    std::vector<float> reg_vec((float *) regressors.data, (float *) regressors.data + num_regressors);
    std::vector<float> score_vec((float *) scores.data, (float *) scores.data + num_scores);

    // 对 score_vec 执行 clip(-100,100) 并计算 sigmoid
    for (auto &s: score_vec) {
        if (s < -100.0f) s = -100.0f;
        if (s > 100.0f) s = 100.0f;
        s = 1.0f / (1.0f + std::exp(-s));
    }
    // 找到最大分数索引
    int max_index = std::distance(score_vec.begin(), std::max_element(score_vec.begin(), score_vec.end()));
    float max_score = score_vec[max_index];
    LOG_I("最大分数: %.4f, 索引: %d", max_score, max_index);

通过flag useGpu 切换使用cpu/gpu 推理 bool useGpu = true; 通过flag useDebugParam 切换是否使用手动调整过的param bool useDebugParam = true;

模型是使用pnnx将onnx转换成的ncnn模型, pnnx输出的模型转换输入:

Input                    in0                      0 1 in0
Permute                  permute_56               1 1 in0 1 0=4

手动调整一下可以传入常规的shape的tensor

Input                    in0                      0 1 in0
Permute                  permute_56               1 1 in0 1 0=6

区别是 permute 参数 type 修改

现在的现象是: 当 useGpu = false 时, useDebugParam 为 true/false 都可以正常输出当 useGpu = true 时, useDebugParam 为 true/false 都可以输出, 但是数值是错误的

完整的项目见附件 ncnn-test.zip

输出的blob部分如下, 前2个blob, 使用cpu和gpu时完全一致, 第三个blob开始产生区别 blob.zip

Feb 17 '25 16:02 XingRay

https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result#disable-fp16 尝试禁用fp16测试下

Feb 18 '25 03:02 nihui

https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result#disable-fp16 尝试禁用fp16测试下

已经尝试过启用和禁用下面的选项:

if (gpu_count > 0) { LOG_D("use_vulkan_compute"); net.opt.use_vulkan_compute = true;

    // set specified vulkan device before loading param and model
    net.set_vulkan_device(0); // use device-0

    net.opt.use_fp16_packed = false;
    net.opt.use_fp16_storage = false;
    net.opt.use_fp16_arithmetic = false;
    net.opt.use_int8_storage = false;
    net.opt.use_int8_arithmetic = false;
}

结果是一样的, 我发现 blob "3" 前一小半部分数值是一样的, 从中间开始有区别,我使用对比工具:

右边可以看到前面一部分是相同的:

blob "3" 是图中这个算子的输出:

Feb 18 '25 05:02 XingRay

我把程序从windows平台移植到android平台, 现象与windows平台运行结果一致:

使用 cpu推理结果正确使用gpu推理可以返回结果, 但是数据是错误的

使用gpu推理时的日志如下:

00:09:41.055 D COLOR_BGRA2RGB 00:09:41.072 D mat_in shape: c=3, d=1, h=128, w=128, dims=4 00:09:41.073 I QUALCOMM build : fdd61e0, I20154638fb Build Date : 10/07/20 Shader Compiler Version : EV031.27.05.01 Local Branch : Remote Branch : refs/tags/AU_LINUX_ANDROID_LA.UM.8.3.R1.10.00.00.520.058 Remote Branch : NONE Reconstruct Branch : NOTHING 00:09:41.073 I Build Config : S P 8.0.11 AArch64 00:09:41.074 W [0 Adreno (TM) 630] queueC=0[3] queueG=0[3] queueT=0[3] 00:09:41.074 W [0 Adreno (TM) 630] bugsbn1=1 bugbilz=0 bugcopc=0 bugihfa=1 00:09:41.074 W [0 Adreno (TM) 630] fp16-p/s/u/a=1/0/0/0 int8-p/s/u/a=1/0/0/0 00:09:41.074 W [0 Adreno (TM) 630] subgroup=64 basic/vote/ballot/shuffle=1/1/0/0 00:09:41.074 W [0 Adreno (TM) 630] fp16-8x8x16/16x8x8/16x8x16/16x16x16=0/0/0/0 00:09:41.074 D gpu_count:1 00:09:41.074 D use_vulkan_compute 00:09:41.074 I load_param: /storage/emulated/0/test/face_detection/face_detector.ncnn_debug.param 00:09:41.079 I load_model: /storage/emulated/0/test/face_detection/face_detector.ncnn.bin 00:09:44.132 D ex.input 00:09:44.132 D ex.extract 00:09:44.286 D regressors shape: c=1, d=1, h=896, w=16, dims=2 00:09:44.286 D scores shape: c=1, d=1, h=896, w=1, dims=2 00:09:44.287 I 最大分数: 0.3218, 索引: 691 00:09:44.290 I 检测结果保存至: /storage/emulated/0/test/output/face_detector_ncnn.png 00:09:44.465 I 原始图像检测结果保存至: /storage/emulated/0/test/output/face_detector_ncnn_with_original.png

初始化net的代码如下:

ncnn::Net net;
        if (useGpu) {
            int gpu_count = ncnn::get_gpu_count();
            LOG_D("gpu_count:%d", gpu_count);
            if (gpu_count <= 0) {
                LOG_E("gpu_count<=0");
                return;
            }

            LOG_D("use_vulkan_compute");
            net.opt.use_vulkan_compute = true;

            // set specified vulkan device before loading param and model
            // net.set_vulkan_device(0); // use device-0

            net.opt.use_fp16_packed = false;
            net.opt.use_fp16_storage = false;
            net.opt.use_fp16_arithmetic = false;
            net.opt.use_int8_storage = false;
            net.opt.use_int8_arithmetic = false;
        }

Feb 19 '25 16:02 XingRay

I am experiencing the same issue. CPU based inference is correct but vulkan is invalid output on all platforms: Mac, amd, and NVIDIA. I believe one of the ncnn vulkan ops implemented has a bug. My rough translation above seems to indicate it’s the convolution2d.

Mar 14 '25 18:03 koush

@XingRay

是输入数据构造的问题，你的代码构造了个4d，实际应该构造3d，就ok了

        // mat_in = ncnn::Mat(3, 128, 128, 1, padded_float.data);
        mat_in = ncnn::Mat(3, 128, 128, padded_float.data);

Apr 11 '25 08:04 nihui

I am experiencing the same issue. CPU based inference is correct but vulkan is invalid output on all platforms: Mac, amd, and NVIDIA. I believe one of the ncnn vulkan ops implemented has a bug. My rough translation above seems to indicate it’s the convolution2d.

https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result

If you still have problems, please raise an issue and attach your model file and input.

Apr 11 '25 08:04 nihui

I am experiencing the same issue. CPU based inference is correct but vulkan is invalid output on all platforms: Mac, amd, and NVIDIA. I believe one of the ncnn vulkan ops implemented has a bug. My rough translation above seems to indicate it’s the convolution2d.

https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result

If you still have problems, please raise an issue and attach your model file and input.

Here https://github.com/Tencent/ncnn/issues/5990

Apr 13 '25 15:04 koush

@XingRay

是输入数据构造的问题，你的代码构造了个4d，实际应该构造3d，就ok了
    // mat_in = ncnn::Mat(3, 128, 128, 1, padded_float.data);
    mat_in = ncnn::Mat(3, 128, 128, padded_float.data);

下面3种构造 mat 的方式我都尝试了, 结果都是一样的, 在cpu模式下可以正常输出结果, 在启用 vulkan 时结果都是错误的, 3种构造方式输出的错误结果也是一样的

//    ncnn::Mat in_mat(3, 128, 128, 1, padded_float.data);
    ncnn::Mat in_mat(3, 128, 128, padded_float.data);
//    ncnn::Mat in_mat = ncnn::Mat::from_pixels(padded_float.data, ncnn::Mat::PixelType::PIXEL_RGB, 128, 128);

源码模型测试数据等见附件 ncnn-test01.zip

Jun 14 '25 18:06 XingRay