Catch OSError when rmtree, to avoid memory leak
We are seeing some error like:
==== detail start, at 20211220.032154.893 ==== Traceback (most recent call last): File "/data/projects/fate/eggroll/python/eggroll/core/utils.py", line 146, in wrapper return func(*args, **kw) File "/data/projects/fate/eggroll/python/eggroll/roll_pair/egg_pair.py", line 247, in run_task shutil.rmtree(path) File "/opt/app-root/lib64/python3.6/shutil.py", line 486, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/opt/app-root/lib64/python3.6/shutil.py", line 424, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/opt/app-root/lib64/python3.6/shutil.py", line 428, in _rmtree_safe_fd onerror(os.rmdir, fullname, sys.exc_info()) File "/opt/app-root/lib64/python3.6/shutil.py", line 426, in _rmtree_safe_fd os.rmdir(name, dir_fd=topfd) OSError: [Errno 39] Directory not empty: '0'
this lead to memory leak of our container, since each clean job cannot finish in eggpair, and those process just keep retrying. So I just catch the OSError and log it, but don't fail on error.
Fixes ISSUE#xxx
Changes:
PS: On k8s environment, our nodemanager container can be as large as 2.5G. This also happens in docker-compose environment, but not as bad as on k8s.
The path to be deleted is : /data/projects/fate/eggroll/data/IN_MEMORY/2021122020272213924715_sir_0_0_guest_20001
ping?
ping
@guojiex Thank you for your Pull Request and the detailed error description!
I apologize for the delayed response. Your fix indeed makes sense. From the stack trace you provided, we can see that shutil.rmtree may fail due to the directory not being empty in certain scenarios. In such cases, rather than letting the task fail outright, it's more reasonable to catch the OSError and log it without failing on the error.
Your approach can help prevent memory leaks in our container, and it's vital for ensuring the stability of our system.
Before we can merge your PR, I must ask you to fix the DCO (Developer Certificate of Origin). This is required by the community contribution guidelines for Linux Foundation projects. You can fix the DCO by running the following commands:
git rebase HEAD~5 --signoff
git push --force-with-lease origin patch-1
Once you have completed these steps, please notify us in this PR, and we will review and merge it as soon as possible.
Thanks again for your contribution!