WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Tier0 WMAgent deployment model for Alma9/RHEL9

Open amaltaro opened this issue 1 year ago • 21 comments

Impact of the new feature Tier0 WMAgent

This issue may be closed by: https://github.com/dmwm/CMSKubernetes/pull/1466

Is your feature request related to a problem? Please describe. This can potentially become another meta-issue, but it's important to start tracking the Tier0 requirements such that we can properly plan for an impending migration of the Tier0 WMAgent stack, together with condor schedd migration to Alma9 OS.

This goes along the plans we have for central production agent, tracked in this meta-issue: https://github.com/dmwm/WMCore/issues/11314 but we need to clarify whether the Tier0 agrees with the same deployment model and/or what changes are required for the T0 environment.

Describe the solution you'd like For now, central production WMAgent deployment will be:

  • package and upload wmagent package to PyPi
  • build WMAgent docker images based on the https://github.com/dmwm/CMSKubernetes/tree/master/docker/pypi/wmagent and PyPi WMAgent package.
  • assume condor_schedd will be deployed on the host
  • For MariaDB (Fermilab agents), run it from a Docker container
  • For Oracle, there is no need to run any service, however we do need to have an up-to-date tnsnames.ora file
  • (docker) compose all these images together

Describe alternatives you've considered None

Additional context See containerization meta-issue: https://github.com/dmwm/WMCore/issues/11314 Depends on: https://github.com/dmwm/WMCore/issues/11981 Depends on: https://github.com/dmwm/WMCore/issues/11982

amaltaro avatar Feb 06 '24 15:02 amaltaro

Thanks for creating this issue @amaltaro

Some updates here on that.

There was some non 0 effort shared between me and @LinaresToine last year for bringing the T0 deployment tests up to the level where the WMAgent containers currently are. We have reached to the point where the Oracle functionalities for the manage scripts and the CouchDB container are needed. Once we resolve: https://github.com/dmwm/WMCore/issues/11720 and https://github.com/dmwm/WMCore/issues/11312, T0 Team will be automatically able to fully run T0 agents either from a WMAgent container, or from the OS directly by using the scripts we provide here: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent-venv.sh (provided the underlying OS is alma9 or rhel9 or equivalent one supporting python 3.8.16 or higher). Last year we tracked down all the needed packages equivalences between distributions and OS dependencies during our meetings - Hopefully Antonio still keeps record on the list :)

FYI @germanfgv @LinaresToine

todor-ivanov avatar Feb 09 '24 10:02 todor-ivanov

logging the current state of this issue:

Yesterday, We took the chance that the two mandatory PRs (https://github.com/dmwm/CMSKubernetes/pull/1409 && https://github.com/dmwm/CMSKubernetes/pull/1451) needed for T0 to start their tests are already in a semifinal stage, and we had a meeting yesterday between me, Andrea, German and Antonio to discuss the new initialization mechanisms of the agents with a small hands on demonstration from my side on how things work currently. During this process I used mostly Docker containers for both CouchDB and WMAgent, because the initialization process for the agent is immutable and independent of the deployment methods, being it Docker or virtual env.

We agreed on another meeting next week when we would have everything merged from our side, such that next time they would be able to perform the hands on activities with guidance on my side.

todor-ivanov avatar Mar 08 '24 06:03 todor-ivanov

p.s. on the comment from above.

During the meeting, we also agreed on starting the proper communication with CERN IT in order to asses how safe it is to add the OS UIDs and GIDs in the Docker repositories, as explained in this comment: https://github.com/dmwm/CMSKubernetes/pull/1412#issuecomment-1967363719

I am about to start this communication today, and will include the proper set of people involved from our side.

FYI: @amaltaro @vkuznet @khurtado @anpicci

todor-ivanov avatar Mar 08 '24 07:03 todor-ivanov

Again for logging the activity on this topic:

Today we had yet another meeting during which T0 Team was doing the hands on activities. We found several issues like:

  • Additional account management needed to happen directly inthe scripts ... will be fixed once we add the T0 account to the CMSKubernetes repository, but for that we'll have to wait for reply from CERN IT
  • We failed with CouchDB connection between the agent and the the CouchDB instance from the Docker container.

Once the later is debugged we plan to have another meeting for tomorrow.

FYI: @amaltaro @germanfgv @LinaresToine @anpicci

todor-ivanov avatar Mar 14 '24 17:03 todor-ivanov

Logging the activities again.

Today we were having yet another long hands on meeting between me, @LinaresToine and @anpicci. This time we we had to resolve few more bugs such as a

  • Mismatch between T0 and Prodction agents configurations - T0 are not configuring ACDC server,
  • Typo in fetching Team name from WMAgent.secrets file
  • Missing Tier0.* account handling in init.sh script
  • Skipping agent config upload step for T0 agents

A PR is coming for resolving all of these.

At the very end we managed to fully deploy and initialize the agent to the very end. But we failed to start the services because of the old licurl - to - pycurl backend mismatch (nss vs. openssl). @LinaresToine is about to try later to resolve this issue, so they can start testing the deployment of the T0 related packages and eventual injections.

FYI: @amaltaro

todor-ivanov avatar Mar 15 '24 16:03 todor-ivanov

A PR in the CMSKubernetes repository was created to fix isues pointed out in the previous comment. See https://github.com/dmwm/CMSKubernetes/pull/1457.

I continue to look into the libcurl - to - pycurl issue pointed out by @todor-ivanov

LinaresToine avatar Mar 17 '24 23:03 LinaresToine

Hi @LinaresToine I've left one comment in the Pr. Please take look

todor-ivanov avatar Mar 19 '24 09:03 todor-ivanov

hi @LinaresToine while you are removing that line I asked in the review, here is your solution about the backend ssl library mismatch in pycurl. You need to follow both of those steps:

  • At the Host:
[root@vocms0290 data]# yum install curl-openssl libcurl-devel libcurl-openssl-devel
  • Inside the Virtual Environment:
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ pip uninstall pycurl
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ export PYCURL_SSL_LIBRARY=openssl
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ pip install --no-cache-dir --global-option=build_ext --global-option="-L/usr/local/opt/openssl/lib" --global-option="-I/usr/local/opt/openssl/include"  pycurl

todor-ivanov avatar Mar 19 '24 16:03 todor-ivanov

Thank you very much @todor-ivanov. Regarding the PR, I will modify it right now. I will point out an observation in the PR itself. Regarding the backend solution, I did try the steps you mention inside the virtual environment. I was missing out on the step in the host. Thank you for your insight!!

LinaresToine avatar Mar 19 '24 17:03 LinaresToine

Hello @todor-ivanov. I created a script to keep track of all the steps taken until the manage start-agent step. This script can be seen in https://github.com/dmwm/T0/pull/4931. Worth clarifying that I am sourcing the script, not executing it.

From our meetings I remember the couchdb-docker-build.sh step is not always necessary, however I include it for now.

The script includes the debugging of the pycurl issue with the environment activated. I may be missing something, but I still get the same error message upon running manage start-agent. I continue to look into it.

LinaresToine avatar Mar 19 '24 22:03 LinaresToine

hi @LinaresToine

I continue to look into it.

Lets quickly meet again next week and we will try to resolve this together.

There were also few more highlights from @germanfgv yesterday, about the corrections we need to apply to the manage script in order to avoid bringing the ACDC related variables into the T0 configuration

todor-ivanov avatar Mar 20 '24 08:03 todor-ivanov

@todor-ivanov @LinaresToine Can you give an update on this issue? Has there been any more progress since the meeting referenced above?

klannon avatar Apr 15 '24 15:04 klannon

Hello @klannon Tier 0 recently got an Alma 9 machine, allowing us to successfully deploy the new agent. That is an update regarding the issue with libcurl and pycurl backend misconfigurations. Regarding the fixes in the init and manage scripts, perhaps @todor-ivanov can give an update on that? Such fixes are documented in https://github.com/dmwm/CMSKubernetes/pull/1457

LinaresToine avatar Apr 15 '24 17:04 LinaresToine

hi @klannon we plan to close this issue this week. There are few more lines that need to get into those fixes, which would be faster if I make them myself. And then we meet with @LinaresToine to complete the process on another hands on meeting.

todor-ivanov avatar Apr 15 '24 17:04 todor-ivanov

So, after our long meeting yesterday with @LinaresToine we ended up with with another possible fix addressing more than one problems: https://github.com/dmwm/CMSKubernetes/pull/1466

But there is still few minor details before we close it. I'll push more changes later tonight.

todor-ivanov avatar Apr 19 '24 16:04 todor-ivanov

hi @amaltaro @anpicci @vkuznet with our latest commits to https://github.com/dmwm/CMSKubernetes/pull/1466 we ( me @LinaresToine @germanfgv ) managed to initialize a T0 agent alma9 machine properly and test this deployment process during our meeting today. There were some new library issues that have popped up the last minute when German decided to actually start a replay (he or Antonio are going to update the issue with the new error). But the biggest success here was that we managed to finish the deployment properly on top of all the rest of the changes we did lately on the WMAgent containers. Please feel free to start your review on the PR in the CMSKuberenetes repository.

todor-ivanov avatar May 02 '24 17:05 todor-ivanov

Hello all. As Todor mentions in the comment above, we were able to initialize and start the T0 agent successfully. After starting, the Tier0Feeder component of the agent displays the following error message:

ERROR:StdBase:About to raise exception <@========== WMException Start ==========@> Exception Class: WMSpecFactoryException Message: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren' ClassName : None ModuleName : WMCore.WMSpec.WMWorkloadTools MethodName : _validateArgFunction ClassInstance : None FileName : /data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py LineNumber : 139 ErrorNr : 0

Traceback: File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 130, in _validateArgFunction if not valFunction(value):

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 1034, in "CMSSWVersion": {"validate": lambda x: x in releases(),

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/ReqMgr/Tools/cms.py", line 219, in releases return TC.releases(arch)

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 89, in releases for row in self.data():

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 83, in data for row in xml_parser(data, pkey):

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 62, in xml_parser get_children(elem, event, row, key)

File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 74, in get_children for child in elem.getchildren():

@amaltaro @vkuznet @todor-ivanov @anpicci @germanfgv

LinaresToine avatar May 02 '24 19:05 LinaresToine

Hi @LinaresToine @todor-ivanov , it seems an error with the XML file itself, rather than a bug, right? In addition, the stacktrace looks like incomplete.

anpicci avatar May 06 '24 10:05 anpicci

Thank you @anpicci . The WMException ended there after the attribute getchildren was not found. However, I can include this portion also:

2024-05-02 19:02:18,253:139701761603136:ERROR:Tier0FeederPoller:Can't configure for run 359688 and stream Calibration Traceback (most recent call last): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 130, in _validateArgFunction if not valFunction(value): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 1034, in "CMSSWVersion": {"validate": lambda x: x in releases(), File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/ReqMgr/Tools/cms.py", line 219, in releases return TC.releases(arch) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 89, in releases for row in self.data(): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 83, in data for row in xml_parser(data, pkey): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 62, in xml_parser get_children(elem, event, row, key) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 74, in get_children for child in elem.getchildren(): AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 944, in masterValidation validateArgumentsCreate(schema, argumentDefinition) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 276, in validateArgumentsCreate _validateArgumentOptions(arguments, argumentDefinition, "optional") File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 160, in _validateArgumentOptions arguments[arg] = _validateArgument(arg, arguments[arg], argDef) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 101, in _validateArgument _validateArgFunction(argument, value, argumentDefinition["validate"]) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 139, in _validateArgFunction raise WMSpecFactoryException(str(ex))

LinaresToine avatar May 06 '24 13:05 LinaresToine

We are in the process of final tests here. More details I gave in my comment to the PR with which I called @amaltaro and @anpicci for final review: https://github.com/dmwm/CMSKubernetes/pull/1466#issuecomment-2107090214

todor-ivanov avatar May 13 '24 09:05 todor-ivanov

The xml library have removed the getchildren attribute after version 3.9: https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.getchildren

LinaresToine avatar May 13 '24 13:05 LinaresToine