PaddleCustomDevice
PaddleCustomDevice copied to clipboard
[intel_gpu] mem leak when runing RN50
we use GLOG_v=10 to run RN50 and found that paddle always allocate mem but w/o deallocate mem when lead out of mem
RN50: https://github.com/PaddlePaddle/PaddleClas/tree/f820473d1d4d5174e57a5a6b08a42f672eb13390
cmd: python ./PaddleClas/tools/train.py -c ./PaddleClas/ppcls/configs/ImageNet/ResNet/ResNet50.yaml
请问是主机内存还是设备内存?paddle内部会复用内存
请问是主机内存还是设备内存?paddle内部会复用内存 设备内存, 我这边看到的结果在 paddle 没有调用到 Deallocate
训练中不会deallocate,你可以尝试 export FLAGS_allocator_strategy=naive_best_fit,并且在插件的 runtime.cc 中 max_chunk_size 始终返回 0
训练中不会deallocate,你可以尝试 export FLAGS_allocator_strategy=naive_best_fit,并且在插件的 runtime.cc 中 max_chunk_size 始终返回 0
According to https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/backends/custom/custom_device.cc#L486 the default MaxChunkSize is 0, and we do not impl DeviceMaxChunkSize in intel_gpu runtime, so i thin k it's already 0
默认的max_chunk_size不为0,你可以打 GLOG_v=10 看一下
VLOG(10) << Type() << " max alloc size " << (max_alloc_size >> 20) << "M";
默认的max_chunk_size不为0,你可以打 GLOG_v=10 看一下
VLOG(10) << Type() << " max alloc size " << (max_alloc_size >> 20) << "M";
https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/backends/custom/custom_device.cc#L492 在 if loop 里面没打印出来,需要自己改一版 WIP
进一步发现爆显存可能发生在 eval 阶段, https://github.com/PaddlePaddle/PaddleClas/blob/f820473d1d4d5174e57a5a6b08a42f672eb13390/ppcls/configs/ImageNet/ResNet/ResNet50.yaml#L8 eval_during_train: False 看到不到相关 oom
您好,请问这个问题是否依旧解决,谢谢!
@qili93 Intel gpu support for paddle is paused due to some policy and market change, I guess we can't be sure before next gen Falcon Shore GPU