PaddleCustomDevice icon indicating copy to clipboard operation
PaddleCustomDevice copied to clipboard

[intel_gpu] mem leak when runing RN50

Open KimBioInfoStudio opened this issue 2 years ago • 7 comments

we use GLOG_v=10 to run RN50 and found that paddle always allocate mem but w/o deallocate mem when lead out of mem

RN50: https://github.com/PaddlePaddle/PaddleClas/tree/f820473d1d4d5174e57a5a6b08a42f672eb13390 cmd: python ./PaddleClas/tools/train.py -c ./PaddleClas/ppcls/configs/ImageNet/ResNet/ResNet50.yaml

KimBioInfoStudio avatar Jul 07 '23 08:07 KimBioInfoStudio

请问是主机内存还是设备内存?paddle内部会复用内存

ronny1996 avatar Jul 10 '23 02:07 ronny1996

请问是主机内存还是设备内存?paddle内部会复用内存 设备内存, 我这边看到的结果在 paddle 没有调用到 Deallocate

KimBioInfoStudio avatar Jul 10 '23 02:07 KimBioInfoStudio

训练中不会deallocate,你可以尝试 export FLAGS_allocator_strategy=naive_best_fit,并且在插件的 runtime.cc 中 max_chunk_size 始终返回 0

ronny1996 avatar Jul 10 '23 05:07 ronny1996

训练中不会deallocate,你可以尝试 export FLAGS_allocator_strategy=naive_best_fit,并且在插件的 runtime.cc 中 max_chunk_size 始终返回 0

According to https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/backends/custom/custom_device.cc#L486 the default MaxChunkSize is 0, and we do not impl DeviceMaxChunkSize in intel_gpu runtime, so i thin k it's already 0

KimBioInfoStudio avatar Jul 10 '23 06:07 KimBioInfoStudio

默认的max_chunk_size不为0,你可以打 GLOG_v=10 看一下 VLOG(10) << Type() << " max alloc size " << (max_alloc_size >> 20) << "M";

ronny1996 avatar Jul 10 '23 07:07 ronny1996

默认的max_chunk_size不为0,你可以打 GLOG_v=10 看一下 VLOG(10) << Type() << " max alloc size " << (max_alloc_size >> 20) << "M";

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/backends/custom/custom_device.cc#L492 在 if loop 里面没打印出来,需要自己改一版 WIP

KimBioInfoStudio avatar Jul 12 '23 03:07 KimBioInfoStudio

进一步发现爆显存可能发生在 eval 阶段, https://github.com/PaddlePaddle/PaddleClas/blob/f820473d1d4d5174e57a5a6b08a42f672eb13390/ppcls/configs/ImageNet/ResNet/ResNet50.yaml#L8 eval_during_train: False 看到不到相关 oom

KimBioInfoStudio avatar Aug 08 '23 02:08 KimBioInfoStudio

您好,请问这个问题是否依旧解决,谢谢!

qili93 avatar May 22 '24 02:05 qili93

@qili93 Intel gpu support for paddle is paused due to some policy and market change, I guess we can't be sure before next gen Falcon Shore GPU

KimBioInfoStudio avatar May 23 '24 02:05 KimBioInfoStudio