WMCore WMAgent: New renew-proxy method left with 0 days window for proxy validity

Impact of the bug WMAgent

Describe the bug While working on the validation of the new deployment methods: https://github.com/dmwm/WMCore/issues/11945 people started reporting quite high frequency of Proxy-expiration errors from central services for their agents. People also report some minor issues with manage renew-proxy command - like Permission denied for renewing an expired proxy etc.:

(WMAgent.venv3-2.3.4rc1) cmst1@vocms0290:WMAgent.venv3-2.3.4rc1 $ manage renew-proxy
_renew_proxy: Checking Certificate lifetime:
_renew_proxy: Certificate end date: Nov  1 17:41:15 2024 GMT
_renew_proxy: Checking myproxy lifetime:
_renew_proxy: myproxy end date: May 22 08:56:39 2024 GMT
MyProxy v6.2 Jan 2024 PAM SASL KRB5 LDAP VOMS OCSP
Attempting to connect to 2001:1458:d00:4a::100:480:7512 
Successfully connected to myproxy.cern.ch:7512 
using trusted certificates directory /etc/grid-security/certificates
Using Host cert file (/data/WMAgent.venv3-2.3.4rc1/certs/servicecert.pem), key file (/data/WMAgent.venv3-2.3.4rc1/certs/servicekey.pem)
server name: /DC=ch/DC=cern/OU=computers/CN=px501.cern.ch
checking that server name is acceptable...
server name matches "myproxy.cern.ch"
authenticated server name is acceptable
A credential has been received for user amaltaro in /data/WMAgent.venv3-2.3.4rc1/certs/mynewproxy.pem.
Contacting voms-cms-auth.app.cern.ch:443 [/DC=ch/DC=cern/OU=computers/CN=cms-auth.web.cern.ch] "cms"...
Remote VOMS server contacted succesfully.

Error creating proxy certificate: /data/WMAgent.venv3-2.3.4rc1/certs/myproxy.pem (Permission denied)
_renew_proxy: ERROR: Failed to renew expired myproxy

This was a long suspected issue. While creating the function we have put a myproxyMinLifetime calculated in days as 7*24*60*60 : https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/bin/manage-common.sh#L316
which is exactly equal to the the myproxyLifetime as issued few lines above: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/bin/manage-common.sh#L284

This leads to 0 days window for deeming the proxy as valid before renewing. This means we are about to renew those proxy certificates with the rate the cronjob for this is executed inside the container.

And the (permission dined issue is due to the extra restrictive file permissions we set for this proxy (i.e. 400 instead of 600) here: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/bin/manage-common.sh#L320

How to reproduce it Try the following set of commands in an agent (being it docker or virtual env):

docker exec -it wmagent bash
manage renew-proxy
...
manage renew-proxy

You will see that the proxy renewal is retried every time.

Expected behavior To renew proxy only 48 hours before its expiration

Additional context and error message The issue: https://github.com/dmwm/WMCore/issues/11945 depends on the resolution of the current one, which makes it a chained dependency for the WMAgent new deployment model meta-issue: https://github.com/dmwm/WMCore/issues/11314

May 15 '24 09:05 todor-ivanov

While working on this solution, I also noticed the cronjobs are even missing within the current containers:

(WMAgent-2.3.3) [cmst1@vocms0260:current]$ crontab -l 
no crontab for cmst1

And I think it has happened when moved to dynamically setting the user at runtime and we stopped exporting the $WMA_USER environment variable. Which by itself broke this step in the init process: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/init.sh#L579

    (set_cronjob)                || { err=$?; echo "ERROR: set_cronjob"; exit $err ;}

How did it go unnoticed this is a different story. I am fixing this within the very same PR and not creating yet another bug issue just for this.

FYI @amaltaro

May 15 '24 09:05 todor-ivanov

Ok, for the last comment I was partially correct, and partially wrong.

The reason for the missing cronjobs is indeed the not exported $WMA_USER environment variable.... But actually we now export this variable from run.sh: https://github.com/dmwm/CMSKubernetes/blob/51d74e64b1a2c8731a6d5411751437776de98356/docker/pypi/wmagent/run.sh#L12 :

export WMA_USER=$wmaUser

Which means, if init.sh is executed from run.sh it can properly get the variable, since it inherits it from the run.sh's environment, while if somebody logs into the a non running agent and executes:

docker exec -it wmagent bash
$WMA_ROOT/init.sh

(which is exactly what I was doing) then the cronjobs are not populated. The immediate workarround would be to restart the container and let it populate them properly. And the result is the correct set of cronjobs at the end:

cmst1@vocms0260:wmagent $ docker kill wmbgent
cmst1@vocms0260:wmagent $ ./wmagent-docker-run.sh -t 2.3.4rc3 && docker  logs -f wmagent 
Checking if there is no other wmagent container running and creating a link to the 2.3.4rc3 in the host mount area.
Starting wmagent:2.3.4rc3 docker container with user: cmst1:zh
b57fe52c18ac3947eac211c5fb60d6bfc71c94e03865ff2688b9a32d99d55d52
Running WMAgent container with user: cmst1 (ID: 31961) and group: zh (ID: 1399)
Setting up bashrc for user: cmst1 under home directory: /home/cmst1
Start initialization

=======================================================
Starting WMAgent with the following initialisation data:
-------------------------------------------------------
 - WMAgent Version            : 2.3.4rc3
 - WMAgent User               : cmst1
 - WMAgent Root path          : /data
 - WMAgent Host               : vocms0260.cern.ch
 - WMAgent TeamName           : testbed-vocms0260
 - WMAgent Number             : 0
 - WMAgent Relational DB type : mysql
 - Python  Version            : Python 3.8.16
 - Python  Module path        : /usr/local/lib/python3.8/site-packages
=======================================================

...
Done: Performing start_agent
-------------------------------------------------------
Start sleeping now ...zzz...


cmst1@vocms0260:wmagent $ docker exec -it wmagent  bash 
(WMAgent-2.3.4rc3) [cmst1@vocms0260:current]$ crontab -l 
55 */12 * * * /data/srv/wmagent/2.3.4rc3/config/manage renew-proxy
58 */12 * * * python /usr/local/deploy/checkProxy.py --proxy /data/certs/myproxy.pem --time 120 --send-mail True --mail [email protected]
*/15 * * * *  source /usr/local/deploy/restartComponent.sh > /dev/null

May 15 '24 10:05 todor-ivanov

To renew proxy only 48 hours before its expiration

@todor-ivanov I don't think this is the actual expected behavior. We cannot let proxies get so close to their expiration, otherwise any pilot/job that runs beyond the standard time can have issues (in addition to potential issues renewing this proxy with the voms server).

The currect setup https://github.com/dmwm/WMCore/blob/master/deploy/renew_proxy.sh renews the proxy for 7 days and it happens every 12h. Please keep this behavior.

May 15 '24 11:05 amaltaro

@amaltaro We do not change the length of the proxy. It stays 168 hours. What we change here is the time window at which we deem the proxy close it it's end of lifetime and we must start renewing it.

What currently happens is:

we issue a proxy which is 168 hours long
we set a minimum lifetime for the proxy 7*24 = 168 hours
then whenever we run the command we always check if the remaining lifetime of the proxy is longer than the minimum hence 168 -168 =0
and we always always fail this check. so on every run of this command it will retry.

Last time you made exactly the same comment and I did not object it, even though I was foreseeing such behavior already, but was not yet sure. Currently we already see it. the proxy is constantly renewed. So as we now have the local variables set look at the PR we can safely widen or shorten this window a we want. Before it was kind of obscure what is going on.

BTW in the PR there are some more fixes

May 15 '24 11:05 todor-ivanov

I don't think we need to check for proxy length, as this is already monitored by AgentStatusWatcher component. For the agent deployment/run, I think we can simplify all of this and simply rely on the cronjob for proxy renewal (not to monitor proxy lifetime, as previously mentioned).

May 15 '24 12:05 amaltaro

hi @amaltaro , but is AgentStatusWatcher updating the proxy if expired or it is just throwing an alarm?

What I did with my latest commit: https://github.com/dmwm/CMSKubernetes/pull/1476/commits/01aade3e560e68c0a0841fd0706ed4709604fb15 is:

Setting the proxy minimal length to be 156 hours, which is exactly the proxy lifetime 168 - the 12 hours you said. So any proxy shorter than this will be renewed.
I've increased the cronjob rate to be every hour, this way we will have the renew retries to be more frequent. The function will not take any action if the proxy is longer than 156 hours, but we will be checking every hour for that with the cronjob. I think this is the safer way.

I am also fixing yet another issue, we have found with you while discussing the current one. When we execute the wmagent-couchapp-init command during the initialization: https://github.com/dmwm/CMSKubernetes/blob/c2b643cb95e502c7773e16ab6953c9b920476e60/docker/pypi/wmagent/bin/manage#L153C1-L153C26 , we create yet another set of CouchDB related cron jobs, which look like this:

* * * * * echo 'http://***:***@localhost:5984/wmagent_jobdump\%2Fjobs/_design/JobDump/_view/statusByWorkflowName?limit=1' | sed -e 's|\\||g' | xargs curl -s -m 1 > /dev/null
* * * * * echo 'http://***:***@localhost:5984/wmagent_jobdump\%2Ffwjrs/_design/FWJRDump/_view/outputByJobID?limit=1' | sed -e 's|\\||g' | xargs curl -s -m 1 > /dev/null

Unfortunately we later blindly wipe them out with the execution of the following command: https://github.com/dmwm/CMSKubernetes/blob/c2b643cb95e502c7773e16ab6953c9b920476e60/docker/pypi/wmagent/init.sh#L305-L309:

    crontab -u $WMA_USER - <<EOF
55 */12 * * * $WMA_MANAGE_DIR/manage renew-proxy
58 */12 * * * python $WMA_DEPLOY_DIR/deploy/checkProxy.py --proxy /data/certs/myproxy.pem --time 120 --send-mail True --mail [email protected]
*/15 * * * *  source $WMA_DEPLOY_DIR/deploy/restartComponent.sh > /dev/null
EOF

So I had to preserve the already existing cronjobs before adding the new ones.

May 15 '24 15:05 todor-ivanov

Yes, AgentStatusWatcher is only monitoring and firing up alerts whenever it is needed.

Our current setup of renewing the proxy every 12h is in place for almost 10 years and we never had any issue with that. I keep my position saying that:

proxies are created with 168h of lifetime
a cronjob runs every 12h (given that we renew the proxy every run, there is no need to run it more often)
if a cronjob fails (server is unavailable), it would have to fail 3 times consecutively before we have an alarm (<= 5 days lifetime)
no need to juggle with bash, output parsing, calculating lifetime, and etc; which just adds more code to be maintained.

May 15 '24 17:05 amaltaro

WMCore WMCore copied to clipboard

WMAgent: New renew-proxy method left with 0 days window for proxy validity

WMCore
WMCore copied to clipboard