Serving 请问Serving0.6显存不释放issue#767新版本解决了嘛

registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda10.2-cudnn8-devel的 Serving0.6执行完多次推理后，闲时显存不释放

Aug 13 '22 08:08 Juruobudong

Message that will be displayed on users' first issue

Aug 13 '22 08:08 github-actions[bot]

你用的那个cuda版本的docker？GPU型号是哪个版本？ @Juruobudong

Aug 16 '22 02:08 xiulianzw

你用的那个cuda版本的docker？GPU型号是哪个版本？ @Juruobudong

Docker version 20.10.5,显卡是 Tesla T4

Aug 17 '22 12:08 Juruobudong

你用的那个cuda版本的docker？GPU型号是哪个版本？ @Juruobudong

您好请问这个问题解决了嘛，是否需要升级到0.7以上

Aug 22 '22 09:08 Juruobudong

目前我测下来pdserving_0.9.0，cuda10.1好像没有显存的溢出问题，在cuda11.2版本上显存的释放确实存在问题，建议使用10.1的版本

Aug 22 '22 09:08 xiulianzw

好的谢谢，我们是cuda10.2，我测试一下哈

Aug 22 '22 09:08 Juruobudong

我用的是docker部署的，registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda10.1-cudnn7-runtime，感觉内存不释放应该是PaddlePaddle的问题，所以你注意以下PaddlePaddle的版本

Aug 22 '22 09:08 xiulianzw

我用的是docker部署的，registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda10.1-cudnn7-runtime，感觉内存不释放应该是PaddlePaddle的问题，所以你注意以下PaddlePaddle的版本@xiulianzw

paddle-serving-app 0.7.0 paddle-serving-client 0.7.0 paddle-serving-server-gpu 0.7.0.post102 paddlepaddle-gpu 2.2.1.post101

好的，查了下会不会是我们服务器PaddlePaddle的版本post101和servingpost102不匹配的问题呢

Aug 23 '22 08:08 Juruobudong

请问下第一个问题，我是客户端每次请求推理后，服务器显存都不会释放，闲时显存不释放

第二个问题，若是传入一张8000*6000显存超出报错后，如下报错一次后，后面再传入平常能通过大小的图片也不会正常出结果了

/PaddleModelDeploy/core/ops/det/init.py:67: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead data = np.fromstring(data, np.uint8) W0823 15:02:26.383855 95 rnn_op.cu.cc:404] If the memory space of the Input WeightList is not continuous, less efficient calculation will be called. Please call flatten_parameters() to make the input memory continuous. /usr/local/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /usr/local/lib/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) W0824 01:16:52.890378 97 operator.cc:242] conv2d raises an exception paddle::memory::allocation::BadAlloc, ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 2.985718GB memory on GPU 0, 14.471985GB memory has been allocated and available memory is only 1.309753GB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79) W0824 01:17:29.489935 97 operator.cc:242] conv2d raises an exception paddle::memory::allocation::BadAlloc, ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 2.985718GB memory on GPU 0, 14.471985GB memory has been allocated and available memory is only 1.309753GB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model.

(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79) W0824 01:17:54.199105 97 operator.cc:242] conv2d raises an exception paddle::memory::allocation::BadAlloc, ResourceExhaustedError:

Aug 24 '22 04:08 Juruobudong

你开启一下GPU的内存优化，然后开启一下输入图片的resize预处理，就可以解决上面的问题了

Aug 24 '22 07:08 xiulianzw

你开启一下GPU的内存优化，然后开启一下输入图片的resize预处理，就可以解决上面的问题了

好的十分感谢，大佬请问下，GPU的内存优化是在哪里触发呢，是op的初始化那边，还是PaddleServing 的配置项，或者PaddlePaddle的环境变量FLAGS嘛

Aug 24 '22 08:08 Juruobudong

config.yml文件里面，mem_optim: True，文本检测和文字识别里面都添加一下

Aug 24 '22 09:08 xiulianzw

config.yml文件里面，mem_optim: True，文本检测和文字识别里面都添加一下

请问我config.yml这样加了，但是还是没生效，没请求时，仍然占了很大的显存11163MiB / 16160MiB

config.yml #rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1 rpc_port: 8000 #http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port http_port: 8800 #worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG ##当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num worker_num: 10 #build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG build_dag_each_worker: False

dag: #op资源类型, True, 为线程模型；False，为进程模型 is_thread_op: True #重试次数 retry: 10 #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用 use_profile: False tracer: interval_s: 10 op: cla: #并发数，is_thread_op=True时，为线程并发；否则为进程并发 concurrency: 1 #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置 local_service_conf: #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测 client_type: local_predictor #det模型路径 model_config: ./models/idcard/ppcls_model_serving #Fetch结果列表，以client_config中fetch_var的alias_name为准 fetch_list: ["save_infer_model/scale_0.tmp_1"] devices_type: 1 #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡 devices: "0" ir_optim: True #显存内存优化开关, 默认False mem_optim: True det: #并发数，is_thread_op=True时，为线程并发；否则为进程并发 concurrency: 1 #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置 local_service_conf: #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测 client_type: local_predictor #det模型路径 model_config: ./models/general/ppocr_det_server_2.0_serving #Fetch结果列表，以client_config中fetch_var的alias_name为准 fetch_list: ["save_infer_model/scale_0.tmp_1"] #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu device_type: 1 #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡 devices: "0" #precsion, 预测精度，降低预测精度可提升预测速度 #GPU 支持: "fp32"(default), "fp16", "int8"； #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8" precision: "fp32"

        #ir_optim开关, 默认False
        ir_optim: True
        #显存内存优化开关, 默认False
        mem_optim: True
rec:
    #并发数，is_thread_op=True时，为线程并发；否则为进程并发
    concurrency: 1
    #超时时间, 单位ms
    timeout: -1
    #Serving交互重试次数，默认不重试
    retry: 1
    #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
    local_service_conf:
        #client类型，包括brpc, grpc和local_predictor。local_predictor不启动Serving服务，进程内预测
        client_type: local_predictor
        #rec模型路径
        model_config: ./models/general/ppocr_rec_server_2.0_serving
        #Fetch结果列表，以client_config中fetch_var的alias_name为准
        fetch_list: ["save_infer_model/scale_0.tmp_1"]
        #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
        device_type: 1
        #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
        devices: "0"
        #precsion, 预测精度，降低预测精度可提升预测速度
        #GPU 支持: "fp32"(default), "fp16", "int8"；
        #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
        precision: "fp32"
        
        #ir_optim开关, 默认False
        ir_optim: True
        #显存内存优化开关, 默认False
        mem_optim: True

Aug 25 '22 01:08 Juruobudong

内存优化的话我测试下来是在内存块占满了的时候，才会触发优化机制。感觉你这GPU的内存占用有点多了，你有设置预处理里面图片最长边的size吗？

Aug 25 '22 10:08 xiulianzw

所以解决方法是两个模型集中在一块卡上用么或者调用某个接口手动释放显存？

因为是做费用清单的检测识别，图片一般都2k*3k，显存占用比较多，话说resize会不会影响精度呢，

Aug 25 '22 11:08 Juruobudong

Serving Serving copied to clipboard

请问Serving0.6显存不释放issue#767新版本解决了嘛

Serving
Serving copied to clipboard