WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Add TokenManagement solution to WMAgents

Open vkuznet opened this issue 2 years ago • 12 comments

Impact of the new feature In order to start switching to token based authentication we need to decide and setup token management solution.

Is your feature request related to a problem? Please describe. Currently there are multiple solutions exists:

  • oidc-agent a standard tool which can be installed via RPMs
  • curl based approach (used in CMSWEB) based on IAM credentials, see token.sh and associated Dockerfile
  • token manager from auth-proxy-serve (APS), see here which is similar to curl based solution but provides daemon (similar to token.sh) capabilities

Describe the solution you'd like Decide which tool to use and adopt it in cronjob for WMAgent. For that we need:

  • new spec file
  • new RPM or any other deployment solution
  • optionally obtain IAM credentials (if we'll rely on them)
  • deploy to WMAgents
  • adopt in WMCore workflows, via TokenManager class in all HTTP requests, see #10939

Describe alternatives you've considered

Additional context #10118 , #10939

vkuznet avatar Jul 06 '22 12:07 vkuznet

Valentin, my preference would be to actually implement the WMAgent token management in the AgentStatusWatcher component. That means, the component would be responsible for:

  • create a new access token
  • refresh/update the access token
  • monitor the lifetime of the access token
  • create alerts in case of an access token lifetime is below a given threshold (configurable)

There is one drawback here though, if tokens have a very short lifetime, then ensuring that this component is always up & running might become a problem. This includes a possible node crash/reboot where condor jobs would be recreated while the agent is down...

amaltaro avatar Jul 06 '22 20:07 amaltaro

Alan, I doubt it is a good idea. The token management should be independent from WMA framework/tools. I don't see any benefits of re-inventing the wheel. I pointed out to three solutions which are independent from WMA/WMCore tools and I don't see any benefits to incorporate yet another solution to WMA/WMCore stack.

vkuznet avatar Jul 09 '22 14:07 vkuznet

In the meeting today Brian noted that keeping refreshed token for HTCondor jobs is a well understood process at FNAL. No need to reinvent. IIUC this means "talk to Farrukh to know more"

belforte avatar Sep 20 '23 14:09 belforte

As discussed in today's WMCore team meeting, we decided to promote this issue to High priority this quarter, while https://github.com/dmwm/WMCore/issues/11728 is getting demoted to Medium priority (this one had been originally considered for Q4).

amaltaro avatar Oct 30 '23 16:10 amaltaro

Just a brief update on this issue. I have been working on this with Stephan L. and HyunWoo from FNAL, running a few tests in submit1 and making a few changes here and there. We can see the ScitokensFile variable in the grid runtime environment, but we still need to ensure that the token: a) is continuously updated on the schedd node b) gets transferred to the grid runtime c) gets continuously updated in the grid runtime

and here is a link to my personal notes: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11199/token.txt

amaltaro avatar Nov 27 '23 21:11 amaltaro

Short update: we are still failing to get the kerberos token in auto-pilot (based on the keytab). I also took this opportunity to update the text file mentioned in my previous comment.

amaltaro avatar Dec 19 '23 13:12 amaltaro

Another update: the keytab has been created - it needs to be updated whenever there is a password change - and that seem to be working properly. However, we are still figuring out where the token is transferred to in the grid job and whether it's properly refreshed. That investigation depends on running workflows (jobs) in the grid and communicating with some experts at FNAL. Given the slow progress on that, I am moving this ticket to Waiting.

In addition, I have also transferred the content of the token.txt file above over to WMCore in GitLab: https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/6

amaltaro avatar Feb 07 '24 17:02 amaltaro

Instead of closing this issue out, I see now that I actually misunderstood this GH issue. This ticket seems to be asking for a solution to manage tokens within the agent allowing it to communicate with external services (central services, MonIT, CRIC, Rucio, etc). While I have been working on setting up a token on the agent side and propagate it to the production grid jobs.

New ticket has been created https://github.com/dmwm/WMCore/issues/11968, which was just added to the project board under Q2/2024. I am now demoting/removing this issue from the current quarter.

amaltaro avatar Apr 15 '24 21:04 amaltaro

well, we need both. And AFAIK @mapellidario has been waiting for you to lead on both. Which seemed fine in the spirit "WMA has this almost done, let's see what they have before we dive into it". But if this new effort is only starting now, please feel free to talk with him and find out if he can help. I'd like to see tokens in use in CRAB before Dario leaves at the end of August :-)

belforte avatar Apr 15 '24 22:04 belforte

note: Brian B. was very clear about "this issue was already solved by e.g. FIFE people at FNAL" and IIRC Farrukh should know everything. IIUC the solution does not require running an OIDC agent. Sorry for noise. LIkely you, Stephan, Valentin already know more/better.

belforte avatar Apr 15 '24 22:04 belforte

@belforte yes, Farrukh was helping with the condor setup on the FNAL schedd side, but as mentioned above, our focus was solely on letting HTCondor manage the token for us and make sure that an up-to-date token would be kept in the grid job and loaded by CMSSW. Unfortunately we are not able to work on any of the other Token related issues for the moment, as there are other higher priority projects taking the team's effort.

I have this documented here: https://cms-wmcore.docs.cern.ch/wmcore/Tokens-in-WMAgent/#next-steps, or here for a better markdown experience.

Lastly, I think another discussion involving the Fermilab team and Brian is going to happen in May, so we might have a clearer roadmap on how to proceed with token integration as well. Nonetheless, we are happy to talk to Dario if he decides to get started on this.

amaltaro avatar Apr 16 '24 02:04 amaltaro

After talking with @stlammel , we agreed that this issue will be addressed once CMSWEB is tokenized, likely Q1/2025

anpicci avatar Jul 05 '24 08:07 anpicci