FastDeploy FP32、FP16、INT8精度相关

请问FastDeploy的PP_OCR_v3是否支持FP32、FP16、INT8的推理部署？另外多batch推理如何开启？

Nov 23 '22 10:11 namemzy

1.FP32肯定支持，FP16需要在用trt后端时，设置option.enable_trt_fp16()来开启 2.根据我们的测试，PPOCR系列做INT8量化没有加速，反而影响精度，所以一键压缩模型工具没有支持, 但是你可以使用paddleslim量化后，直接部署INT8模型，和部署FP32模型没有差异. 3.多batch推理目前已在cls模型和rec模型自动支持上了，不需要开关，如果你想要给det模型也加上多batch推理，修改ppocrv3.predict()的输入，输入一个list的图片，也就开启了det模型的多batch推理

Nov 23 '22 11:11 yunyaoXYY

好的，谢谢！前两条没问题，第三条， ppocrv3.predict()不能支持多batch推理，可以用ppocr_v3.batch_predict() 但是我用多batch推理之后，并没有获得速度的提升，测试结果如下： Ubuntu 20.04 Server CPU：Gold 6226R GPU：RTX 3090 CUDA：11.2 Paddle：paddlepaddle-gpu 2.3.2.post112 Fastdeploy：fastdeploy-gpu-python 0.8.0 Python：3.8.15 表中所有时间均换算为单张图片耗时，单位ms 微信截图_20221124143409

下面是我的代码，不知道是不是使用的不对

runtime_option = build_option(args)
runtime_option.enable_trt_fp16()

# 当使用TRT时，分别给三个模型的runtime设置动态shape,并完成模型的创建.
# 注意: 需要在检测模型创建完成后，再设置分类模型的动态输入并创建分类模型, 识别模型同理.
# 如果用户想要自己改动检测模型的输入shape, 我们建议用户把检测模型的长和高设置为32的倍数.
det_option = runtime_option
det_option.set_trt_input_shape("x", [1, 3, 64, 64], [16, 3, 640, 640],
                               [16, 3, 960, 960])
# 用户可以把TRT引擎文件保存至本地
det_option.set_trt_cache_file(args.det_model  + "/det_trt_cache_fp16.trt")
det_model = fd.vision.ocr.DBDetector(
    det_model_file, det_params_file, runtime_option=det_option)

cls_option = runtime_option
cls_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 1024])
# 用户可以把TRT引擎文件保存至本地
cls_option.set_trt_cache_file(args.cls_model  + "/cls_trt_cache_fp16.trt")
cls_model = fd.vision.ocr.Classifier(
    cls_model_file, cls_params_file, runtime_option=cls_option)

rec_option = runtime_option
rec_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 2304])
# 用户可以把TRT引擎文件保存至本地
rec_option.set_trt_cache_file(args.rec_model  + "/rec_trt_cache_fp16.trt")
rec_model = fd.vision.ocr.Recognizer(
    rec_model_file, rec_params_file, rec_label_file, runtime_option=rec_option)

# 创建PP-OCR，串联3个模型，其中cls_model可选，如无需求，可设置为None
ppocr_v3 = fd.vision.ocr.PPOCRv3(
    det_model=det_model, cls_model=cls_model, rec_model=rec_model)

# 预测图片准备
im = cv2.imread(args.image)

#预测并打印结果
t1 = time.time()
result = ppocr_v3.batch_predict([im, im, im, im])
t2 = time.time()
print("cost : {} ms".format((t2 - t1) * 1000))

for batch in [1, 2, 4, 8, 16]:
    batch_images = [im for i in range(batch)]
    print(len(batch_images))
    total_cnt = 100
    t1 = time.time()
    for i in range(total_cnt):
        result = ppocr_v3.batch_predict(batch_images)
    t2 = time.time()
    print("batch_size {} cost : {} ms".format(batch, (t2 - t1) * 1000 / (total_cnt * batch)))

Nov 24 '22 06:11 namemzy

好的，谢谢！前两条没问题，第三条， ppocrv3.predict()不能支持多batch推理，可以用ppocr_v3.batch_predict() 但是我用多batch推理之后，并没有获得速度的提升，测试结果如下： Ubuntu 20.04 Server CPU：Gold 6226R GPU：RTX 3090 CUDA：11.2 Paddle：paddlepaddle-gpu 2.3.2.post112 Fastdeploy：fastdeploy-gpu-python 0.8.0 Python：3.8.15 表中所有时间均换算为单张图片耗时，单位ms 微信截图_20221124143409

下面是我的代码，不知道是不是使用的不对

runtime_option = build_option(args)
runtime_option.enable_trt_fp16()

# 当使用TRT时，分别给三个模型的runtime设置动态shape,并完成模型的创建.
# 注意: 需要在检测模型创建完成后，再设置分类模型的动态输入并创建分类模型, 识别模型同理.
# 如果用户想要自己改动检测模型的输入shape, 我们建议用户把检测模型的长和高设置为32的倍数.
det_option = runtime_option
det_option.set_trt_input_shape("x", [1, 3, 64, 64], [16, 3, 640, 640],
                               [16, 3, 960, 960])
# 用户可以把TRT引擎文件保存至本地
det_option.set_trt_cache_file(args.det_model  + "/det_trt_cache_fp16.trt")
det_model = fd.vision.ocr.DBDetector(
    det_model_file, det_params_file, runtime_option=det_option)

cls_option = runtime_option
cls_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 1024])
# 用户可以把TRT引擎文件保存至本地
cls_option.set_trt_cache_file(args.cls_model  + "/cls_trt_cache_fp16.trt")
cls_model = fd.vision.ocr.Classifier(
    cls_model_file, cls_params_file, runtime_option=cls_option)

rec_option = runtime_option
rec_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 2304])
# 用户可以把TRT引擎文件保存至本地
rec_option.set_trt_cache_file(args.rec_model  + "/rec_trt_cache_fp16.trt")
rec_model = fd.vision.ocr.Recognizer(
    rec_model_file, rec_params_file, rec_label_file, runtime_option=rec_option)

# 创建PP-OCR，串联3个模型，其中cls_model可选，如无需求，可设置为None
ppocr_v3 = fd.vision.ocr.PPOCRv3(
    det_model=det_model, cls_model=cls_model, rec_model=rec_model)

# 预测图片准备
im = cv2.imread(args.image)

#预测并打印结果
t1 = time.time()
result = ppocr_v3.batch_predict([im, im, im, im])
t2 = time.time()
print("cost : {} ms".format((t2 - t1) * 1000))

for batch in [1, 2, 4, 8, 16]:
    batch_images = [im for i in range(batch)]
    print(len(batch_images))
    total_cnt = 100
    t1 = time.time()
    for i in range(total_cnt):
        result = ppocr_v3.batch_predict(batch_images)
    t2 = time.time()
    print("batch_size {} cost : {} ms".format(batch, (t2 - t1) * 1000 / (total_cnt * batch)))

目前

好的，谢谢！前两条没问题，第三条， ppocrv3.predict()不能支持多batch推理，可以用ppocr_v3.batch_predict() 但是我用多batch推理之后，并没有获得速度的提升，测试结果如下： Ubuntu 20.04 Server CPU：Gold 6226R GPU：RTX 3090 CUDA：11.2 Paddle：paddlepaddle-gpu 2.3.2.post112 Fastdeploy：fastdeploy-gpu-python 0.8.0 Python：3.8.15 表中所有时间均换算为单张图片耗时，单位ms 微信截图_20221124143409

下面是我的代码，不知道是不是使用的不对

runtime_option = build_option(args)
runtime_option.enable_trt_fp16()

# 当使用TRT时，分别给三个模型的runtime设置动态shape,并完成模型的创建.
# 注意: 需要在检测模型创建完成后，再设置分类模型的动态输入并创建分类模型, 识别模型同理.
# 如果用户想要自己改动检测模型的输入shape, 我们建议用户把检测模型的长和高设置为32的倍数.
det_option = runtime_option
det_option.set_trt_input_shape("x", [1, 3, 64, 64], [16, 3, 640, 640],
                               [16, 3, 960, 960])
# 用户可以把TRT引擎文件保存至本地
det_option.set_trt_cache_file(args.det_model  + "/det_trt_cache_fp16.trt")
det_model = fd.vision.ocr.DBDetector(
    det_model_file, det_params_file, runtime_option=det_option)

cls_option = runtime_option
cls_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 1024])
# 用户可以把TRT引擎文件保存至本地
cls_option.set_trt_cache_file(args.cls_model  + "/cls_trt_cache_fp16.trt")
cls_model = fd.vision.ocr.Classifier(
    cls_model_file, cls_params_file, runtime_option=cls_option)

rec_option = runtime_option
rec_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 2304])
# 用户可以把TRT引擎文件保存至本地
rec_option.set_trt_cache_file(args.rec_model  + "/rec_trt_cache_fp16.trt")
rec_model = fd.vision.ocr.Recognizer(
    rec_model_file, rec_params_file, rec_label_file, runtime_option=rec_option)

# 创建PP-OCR，串联3个模型，其中cls_model可选，如无需求，可设置为None
ppocr_v3 = fd.vision.ocr.PPOCRv3(
    det_model=det_model, cls_model=cls_model, rec_model=rec_model)

# 预测图片准备
im = cv2.imread(args.image)

#预测并打印结果
t1 = time.time()
result = ppocr_v3.batch_predict([im, im, im, im])
t2 = time.time()
print("cost : {} ms".format((t2 - t1) * 1000))

for batch in [1, 2, 4, 8, 16]:
    batch_images = [im for i in range(batch)]
    print(len(batch_images))
    total_cnt = 100
    t1 = time.time()
    for i in range(total_cnt):
        result = ppocr_v3.batch_predict(batch_images)
    t2 = time.time()
    print("batch_size {} cost : {} ms".format(batch, (t2 - t1) * 1000 / (total_cnt * batch)))

好的，谢谢！前两条没问题，第三条， ppocrv3.predict()不能支持多batch推理，可以用ppocr_v3.batch_predict() 但是我用多batch推理之后，并没有获得速度的提升，测试结果如下： Ubuntu 20.04 Server CPU：Gold 6226R GPU：RTX 3090 CUDA：11.2 Paddle：paddlepaddle-gpu 2.3.2.post112 Fastdeploy：fastdeploy-gpu-python 0.8.0 Python：3.8.15 表中所有时间均换算为单张图片耗时，单位ms 微信截图_20221124143409

下面是我的代码，不知道是不是使用的不对

runtime_option = build_option(args)
runtime_option.enable_trt_fp16()

# 当使用TRT时，分别给三个模型的runtime设置动态shape,并完成模型的创建.
# 注意: 需要在检测模型创建完成后，再设置分类模型的动态输入并创建分类模型, 识别模型同理.
# 如果用户想要自己改动检测模型的输入shape, 我们建议用户把检测模型的长和高设置为32的倍数.
det_option = runtime_option
det_option.set_trt_input_shape("x", [1, 3, 64, 64], [16, 3, 640, 640],
                               [16, 3, 960, 960])
# 用户可以把TRT引擎文件保存至本地
det_option.set_trt_cache_file(args.det_model  + "/det_trt_cache_fp16.trt")
det_model = fd.vision.ocr.DBDetector(
    det_model_file, det_params_file, runtime_option=det_option)

cls_option = runtime_option
cls_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 1024])
# 用户可以把TRT引擎文件保存至本地
cls_option.set_trt_cache_file(args.cls_model  + "/cls_trt_cache_fp16.trt")
cls_model = fd.vision.ocr.Classifier(
    cls_model_file, cls_params_file, runtime_option=cls_option)

rec_option = runtime_option
rec_option.set_trt_input_shape("x", [1, 3, 48, 10], [16, 3, 48, 320],
                               [64, 3, 48, 2304])
# 用户可以把TRT引擎文件保存至本地
rec_option.set_trt_cache_file(args.rec_model  + "/rec_trt_cache_fp16.trt")
rec_model = fd.vision.ocr.Recognizer(
    rec_model_file, rec_params_file, rec_label_file, runtime_option=rec_option)

# 创建PP-OCR，串联3个模型，其中cls_model可选，如无需求，可设置为None
ppocr_v3 = fd.vision.ocr.PPOCRv3(
    det_model=det_model, cls_model=cls_model, rec_model=rec_model)

# 预测图片准备
im = cv2.imread(args.image)

#预测并打印结果
t1 = time.time()
result = ppocr_v3.batch_predict([im, im, im, im])
t2 = time.time()
print("cost : {} ms".format((t2 - t1) * 1000))

for batch in [1, 2, 4, 8, 16]:
    batch_images = [im for i in range(batch)]
    print(len(batch_images))
    total_cnt = 100
    t1 = time.time()
    for i in range(total_cnt):
        result = ppocr_v3.batch_predict(batch_images)
    t2 = time.time()
    print("batch_size {} cost : {} ms".format(batch, (t2 - t1) * 1000 / (total_cnt * batch)))

Hi,目前PPOCR支持BS，体现在cls和rec模型. 对比之前老的PPOCR已经有很大的速度提升了. 之前我有个地方说错了， ppocr_v3.batch_predict和ppocr_v3.Predict目前在功能上是一样的(无论是否是一次送1张图还是送多张图)，在Det检测模型中，固定的BS是1，后面分类和识别是支持BS的.

Nov 24 '22 07:11 yunyaoXYY

理解了，谢谢解答！

Nov 24 '22 07:11 namemzy

你好，我单独测试了rec模型的多batch推理，速度有比较大的提升，如下表但是我在跑整个pp_ocr_v3的时候，对比python版本和c++ 版本的推理速度，c++版本居然更慢（如下表），请问有可能是什么原因导致的？测试代码

std::vector<int> batchs = {1, 2, 4, 8, 16};
  for (auto batch : batchs)
  {
    // 构造batch_images
    std::vector<cv::Mat> batch_images;
    for (int image_id = 0; image_id < batch; image_id++)
    {
      batch_images.push_back(im);
    }

    int total_cnt = 100;
    std::vector<fastdeploy::vision::OCRResult> results;
    auto start = std::chrono::system_clock::now();
    for (int i = 0; i < total_cnt; i++){
      ppocr_v3.BatchPredict(batch_images, &results);
    }
    auto end = std::chrono::system_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    double time_ms = double(duration.count()) *
                1000 * std::chrono::microseconds::period::num / std::chrono::microseconds::period::den;
    std::cout << "Cost " << time_ms / (total_cnt * batch) << " ms" << std::endl;
    for (int id = 0; id < batch; id++)
    {
        // std::cout << results[id].Str() << std::endl;
    }
  }

运行环境 Ubuntu 20.04 Server CPU：Gold 6226R GPU：RTX 3090 CUDA：11.2 Paddle：paddlepaddle-gpu 2.3.2.post112 Fastdeploy：fastdeploy-gpu-python 0.8.0 Python：3.8.15 表中所有时间均换算为单张图片耗时，单位ms

Nov 30 '22 04:11 namemzy

请问FastDeploy的PP_OCR_v3是否支持FP32、FP16、INT8的推理部署？另外多batch推理如何开启？

我改为了fp16，但为啥我推理同一张图片和fp32时间几乎一样？请问这是为啥呢？

Jan 26 '24 08:01 txy00001

FastDeploy FastDeploy copied to clipboard

FP32、FP16、INT8精度相关

FastDeploy
FastDeploy copied to clipboard