WMCore
WMCore copied to clipboard
WMAgent: New renew-proxy method left with 0 days window for proxy validity
Impact of the bug WMAgent
Describe the bug
While working on the validation of the new deployment methods: https://github.com/dmwm/WMCore/issues/11945 people started reporting quite high frequency of Proxy-expiration
errors from central services for their agents. People also report some minor issues with manage renew-proxy
command - like Permission denied for renewing an expired proxy etc.:
(WMAgent.venv3-2.3.4rc1) cmst1@vocms0290:WMAgent.venv3-2.3.4rc1 $ manage renew-proxy
_renew_proxy: Checking Certificate lifetime:
_renew_proxy: Certificate end date: Nov 1 17:41:15 2024 GMT
_renew_proxy: Checking myproxy lifetime:
_renew_proxy: myproxy end date: May 22 08:56:39 2024 GMT
MyProxy v6.2 Jan 2024 PAM SASL KRB5 LDAP VOMS OCSP
Attempting to connect to 2001:1458:d00:4a::100:480:7512
Successfully connected to myproxy.cern.ch:7512
using trusted certificates directory /etc/grid-security/certificates
Using Host cert file (/data/WMAgent.venv3-2.3.4rc1/certs/servicecert.pem), key file (/data/WMAgent.venv3-2.3.4rc1/certs/servicekey.pem)
server name: /DC=ch/DC=cern/OU=computers/CN=px501.cern.ch
checking that server name is acceptable...
server name matches "myproxy.cern.ch"
authenticated server name is acceptable
A credential has been received for user amaltaro in /data/WMAgent.venv3-2.3.4rc1/certs/mynewproxy.pem.
Contacting voms-cms-auth.app.cern.ch:443 [/DC=ch/DC=cern/OU=computers/CN=cms-auth.web.cern.ch] "cms"...
Remote VOMS server contacted succesfully.
Error creating proxy certificate: /data/WMAgent.venv3-2.3.4rc1/certs/myproxy.pem (Permission denied)
_renew_proxy: ERROR: Failed to renew expired myproxy
This was a long suspected issue. While creating the function we have put a myproxyMinLifetime calculated in days as 7*24*60*60
:
https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/bin/manage-common.sh#L316
which is exactly equal to the the myproxyLifetime
as issued few lines above: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/bin/manage-common.sh#L284
This leads to 0 days window for deeming the proxy as valid before renewing. This means we are about to renew those proxy certificates with the rate the cronjob for this is executed inside the container.
And the (permission dined issue is due to the extra restrictive file permissions we set for this proxy (i.e. 400 instead of 600) here: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/bin/manage-common.sh#L320
How to reproduce it Try the following set of commands in an agent (being it docker or virtual env):
docker exec -it wmagent bash
manage renew-proxy
...
manage renew-proxy
You will see that the proxy renewal is retried every time.
Expected behavior To renew proxy only 48 hours before its expiration
Additional context and error message The issue: https://github.com/dmwm/WMCore/issues/11945 depends on the resolution of the current one, which makes it a chained dependency for the WMAgent new deployment model meta-issue: https://github.com/dmwm/WMCore/issues/11314
While working on this solution, I also noticed the cronjobs are even missing within the current containers:
(WMAgent-2.3.3) [cmst1@vocms0260:current]$ crontab -l
no crontab for cmst1
And I think it has happened when moved to dynamically setting the user at runtime and we stopped exporting the $WMA_USER
environment variable. Which by itself broke this step in the init process: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/init.sh#L579
(set_cronjob) || { err=$?; echo "ERROR: set_cronjob"; exit $err ;}
How did it go unnoticed this is a different story. I am fixing this within the very same PR and not creating yet another bug issue just for this.
FYI @amaltaro
Ok, for the last comment I was partially correct, and partially wrong.
The reason for the missing cronjobs is indeed the not exported $WMA_USER
environment variable.... But actually we now export this variable from run.sh
: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/run.sh#L12 :
export WMA_USER=$wmaUser
Which means, if init.sh
is executed from run.sh
it can properly get the variable, since it inherits it from the run.sh
's environment, while if somebody logs into the a non running agent and executes:
docker exec -it wmagent bash
$WMA_ROOT/init.sh
(which is exactly what I was doing) then the cronjobs are not populated. The immediate workarround would be to restart the container and let it populate them properly. And the result is the correct set of cronjobs at the end:
cmst1@vocms0260:wmagent $ docker kill wmbgent
cmst1@vocms0260:wmagent $ ./wmagent-docker-run.sh -t 2.3.4rc3 && docker logs -f wmagent
Checking if there is no other wmagent container running and creating a link to the 2.3.4rc3 in the host mount area.
Starting wmagent:2.3.4rc3 docker container with user: cmst1:zh
b57fe52c18ac3947eac211c5fb60d6bfc71c94e03865ff2688b9a32d99d55d52
Running WMAgent container with user: cmst1 (ID: 31961) and group: zh (ID: 1399)
Setting up bashrc for user: cmst1 under home directory: /home/cmst1
Start initialization
=======================================================
Starting WMAgent with the following initialisation data:
-------------------------------------------------------
- WMAgent Version : 2.3.4rc3
- WMAgent User : cmst1
- WMAgent Root path : /data
- WMAgent Host : vocms0260.cern.ch
- WMAgent TeamName : testbed-vocms0260
- WMAgent Number : 0
- WMAgent Relational DB type : mysql
- Python Version : Python 3.8.16
- Python Module path : /usr/local/lib/python3.8/site-packages
=======================================================
...
Done: Performing start_agent
-------------------------------------------------------
Start sleeping now ...zzz...
cmst1@vocms0260:wmagent $ docker exec -it wmagent bash
(WMAgent-2.3.4rc3) [cmst1@vocms0260:current]$ crontab -l
55 */12 * * * /data/srv/wmagent/2.3.4rc3/config/manage renew-proxy
58 */12 * * * python /usr/local/deploy/checkProxy.py --proxy /data/certs/myproxy.pem --time 120 --send-mail True --mail [email protected]
*/15 * * * * source /usr/local/deploy/restartComponent.sh > /dev/null
To renew proxy only 48 hours before its expiration
@todor-ivanov I don't think this is the actual expected behavior. We cannot let proxies get so close to their expiration, otherwise any pilot/job that runs beyond the standard time can have issues (in addition to potential issues renewing this proxy with the voms server).
The currect setup https://github.com/dmwm/WMCore/blob/master/deploy/renew_proxy.sh renews the proxy for 7 days and it happens every 12h. Please keep this behavior.
@amaltaro We do not change the length of the proxy. It stays 168 hours. What we change here is the time window at which we deem the proxy close it it's end of lifetime and we must start renewing it.
What currently happens is:
- we issue a proxy which is 168 hours long
- we set a minimum lifetime for the proxy 7*24 = 168 hours
- then whenever we run the command we always check if the remaining lifetime of the proxy is longer than the minimum hence 168 -168 =0
- and we always always fail this check. so on every run of this command it will retry.
Last time you made exactly the same comment and I did not object it, even though I was foreseeing such behavior already, but was not yet sure. Currently we already see it. the proxy is constantly renewed. So as we now have the local variables set look at the PR we can safely widen or shorten this window a we want. Before it was kind of obscure what is going on.
BTW in the PR there are some more fixes
I don't think we need to check for proxy length, as this is already monitored by AgentStatusWatcher component. For the agent deployment/run, I think we can simplify all of this and simply rely on the cronjob for proxy renewal (not to monitor proxy lifetime, as previously mentioned).
hi @amaltaro , but is AgentStatusWatcher
updating the proxy if expired or it is just throwing an alarm?
What I did with my latest commit: https://github.com/dmwm/CMSKubernetes/pull/1476/commits/01aade3e560e68c0a0841fd0706ed4709604fb15 is:
- Setting the proxy minimal length to be 156 hours, which is exactly the proxy lifetime 168 - the 12 hours you said. So any proxy shorter than this will be renewed.
- I've increased the cronjob rate to be every hour, this way we will have the renew retries to be more frequent. The function will not take any action if the proxy is longer than 156 hours, but we will be checking every hour for that with the cronjob. I think this is the safer way.
I am also fixing yet another issue, we have found with you while discussing the current one.
When we execute the wmagent-couchapp-init
command during the initialization: https://github.com/dmwm/CMSKubernetes/blob/c2b643cb95e502c7773e16ab6953c9b920476e60/docker/pypi/wmagent/bin/manage#L153C1-L153C26 , we create yet another set of CouchDB
related cron jobs, which look like this:
* * * * * echo 'http://***:***@localhost:5984/wmagent_jobdump\%2Fjobs/_design/JobDump/_view/statusByWorkflowName?limit=1' | sed -e 's|\\||g' | xargs curl -s -m 1 > /dev/null
* * * * * echo 'http://***:***@localhost:5984/wmagent_jobdump\%2Ffwjrs/_design/FWJRDump/_view/outputByJobID?limit=1' | sed -e 's|\\||g' | xargs curl -s -m 1 > /dev/null
Unfortunately we later blindly wipe them out with the execution of the following command: https://github.com/dmwm/CMSKubernetes/blob/c2b643cb95e502c7773e16ab6953c9b920476e60/docker/pypi/wmagent/init.sh#L305-L309:
crontab -u $WMA_USER - <<EOF
55 */12 * * * $WMA_MANAGE_DIR/manage renew-proxy
58 */12 * * * python $WMA_DEPLOY_DIR/deploy/checkProxy.py --proxy /data/certs/myproxy.pem --time 120 --send-mail True --mail [email protected]
*/15 * * * * source $WMA_DEPLOY_DIR/deploy/restartComponent.sh > /dev/null
EOF
So I had to preserve the already existing cronjobs before adding the new ones.
Yes, AgentStatusWatcher is only monitoring and firing up alerts whenever it is needed.
Our current setup of renewing the proxy every 12h is in place for almost 10 years and we never had any issue with that. I keep my position saying that:
- proxies are created with 168h of lifetime
- a cronjob runs every 12h (given that we renew the proxy every run, there is no need to run it more often)
- if a cronjob fails (server is unavailable), it would have to fail 3 times consecutively before we have an alarm (<= 5 days lifetime)
- no need to juggle with bash, output parsing, calculating lifetime, and etc; which just adds more code to be maintained.