WMCore
WMCore copied to clipboard
Switch to psutils for resource utilization monitoring at runtime
Fixes #11667
Status
not-tested
Description
The issue https://github.com/dmwm/WMCore/issues/11667 can be solved in 3 different ways. This is the third out of 3 suggested fixes.
- The current one completely avoids running any shell commands at runtime. But instead, it relies on the fact there is already the
psutils
library distributed withcmssw
and directly uses it to fetch the step's resource statistics inside thePerformanceMonitor.py
module itself.
This approach requires some more dramatic changes to the code but also achieves:
- code improvement
- avoid dangerous and cumbersome shell command line execution at runtime
- better resource logging, since we no longer need to use the standard linux
%cpu
and%mem
mappings which are difficult to interpret at a first glance. (Those are resource utilization rations e.g. cputime/realtime ratio, expressed as a percentage), but we are now logging the much more informativesystem cputime
andvirtual memory
for the process.
This would require more detailed testing though.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
No
Jenkins results:
- Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 changes in unstable tests
- Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 2 warnings
- 15 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 2 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14374/artifact/artifacts/PullRequestReport.html
Jenkins results:
- Python3 Unit tests: succeeded
- 1 tests no longer failing
- Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 2 warnings
- 15 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 2 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14375/artifact/artifacts/PullRequestReport.html
Hi @smuzaffar,
In the context of the current PR, since we need to import psutil
module at runtime, we need some more information about it's presence in CMSSW. Can you tell us which is the earliest CMSSW release providing it, i.e. can we rely on it if a workflow trying to use a 4-5 years old release tries to load it?
Jenkins results:
- Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 changes in unstable tests
- Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 2 warnings
- 15 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 2 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14376/artifact/artifacts/PullRequestReport.html
@todor-ivanov , I see that it is available in CMSSW_9_0_0
release which is 6 years old (Mar 2017). So it should be available in all release above CMSSW_9_0_0
but its version might be different in different releases.
Thanks @smuzaffar , where could I check the psutil
versions included in the different CMSSW
releases. I am testing the few methods from psutils
we are using, to check if they are backwards compatible. I started with a really old version like 3.0.0, but it seams they are not. So I need to double check how far back in psutil
versions we may go and map it to the relevant CMSSW
release.
you can create cmssw dev area for CMSSW_9_0_0, set env and then import psutil to check the version. This should be the min psutil version you should support. You can also do the same for latest CMSSW_13_2_0_pre2 release and check the psutil version there.
Note that if you take psutil from cmssw then things can break as for future releases we will move to some newer version of psutil which might not be compatible with your code.
Jenkins results:
- Python3 Unit tests: succeeded
- 1 tests no longer failing
- 3 changes in unstable tests
- Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 2 warnings
- 15 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 2 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14379/artifact/artifacts/PullRequestReport.html
Thanks @smuzaffar!
I just created a CMSSW_9_0_0 dev area and checked the version: [1]. So the earliest psutil
version available is 5.0.1. Which is good for us. I also made a thorough history check of all the methods from psutil
we are using in this module (log file included as attachment to the current comment: psutil.vtest.log). All methods we need start with version 5.0.0 and has not changed their signature since then. So I'd say as long as we do not have workflows in the system
going bellow CMSSW_9_0_0
we are safe.
As of the future, you are correct. We will have to think of a way to have this validated if the version changes. One thing is for sure - it won't be left unnoticed in production, because jobs from workflows which had the malchance to include a new and backwords incompatible version of psutil
, will be massively killed by the watchdog. And even though it may seem unlikely, it is still a third parity library, which none of us maintains, so the possibility is indeed nonzero.
@amaltaro I'd be glad to hear your opinion on this.
[1]
$ source /cvmfs/cms.cern.ch/cmsset_default.sh
$ export SCRAM_ARCH=slc7_amd64_gcc630
$ cmsrel CMSSW_9_0_0
$ cd CMSSW_9_0_0/src/
$ cmsenv
$ python -c "import psutil; print(str(psutil.version_info))"
(5, 0, 1)
Jenkins results:
- Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 3 changes in unstable tests
- Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 2 warnings
- 16 comments to review
- Pylint py3k check: succeeded
- Pycodestyle check: succeeded
- 3 comments to review
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14380/artifact/artifacts/PullRequestReport.html
please consider to,
If the job is killed because it exceeds the memory (or time) limit , to dump as much info as possible in particular
/proc/
Can one of the admins verify this patch?