WMCore
WMCore copied to clipboard
Tier0 WMAgent deployment model for Alma9/RHEL9
Impact of the new feature Tier0 WMAgent
This issue may be closed by: https://github.com/dmwm/CMSKubernetes/pull/1466
Is your feature request related to a problem? Please describe. This can potentially become another meta-issue, but it's important to start tracking the Tier0 requirements such that we can properly plan for an impending migration of the Tier0 WMAgent stack, together with condor schedd migration to Alma9 OS.
This goes along the plans we have for central production agent, tracked in this meta-issue: https://github.com/dmwm/WMCore/issues/11314 but we need to clarify whether the Tier0 agrees with the same deployment model and/or what changes are required for the T0 environment.
Describe the solution you'd like For now, central production WMAgent deployment will be:
- package and upload
wmagent
package to PyPi - build WMAgent docker images based on the https://github.com/dmwm/CMSKubernetes/tree/master/docker/pypi/wmagent and PyPi WMAgent package.
- assume condor_schedd will be deployed on the host
- For MariaDB (Fermilab agents), run it from a Docker container
- For Oracle, there is no need to run any service, however we do need to have an up-to-date tnsnames.ora file
- (docker) compose all these images together
Describe alternatives you've considered None
Additional context See containerization meta-issue: https://github.com/dmwm/WMCore/issues/11314 Depends on: https://github.com/dmwm/WMCore/issues/11981 Depends on: https://github.com/dmwm/WMCore/issues/11982
Thanks for creating this issue @amaltaro
Some updates here on that.
There was some non 0 effort shared between me and @LinaresToine last year for bringing the T0 deployment tests up to the level where the WMAgent containers currently are. We have reached to the point where the Oracle functionalities for the manage
scripts and the CouchDB container are needed. Once we resolve: https://github.com/dmwm/WMCore/issues/11720 and https://github.com/dmwm/WMCore/issues/11312, T0 Team will be automatically able to fully run T0 agents either from a WMAgent container, or from the OS directly by using the scripts we provide here: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent-venv.sh (provided the underlying OS is alma9
or rhel9
or equivalent one supporting python 3.8.16
or higher). Last year we tracked down all the needed packages equivalences between distributions and OS dependencies during our meetings - Hopefully Antonio still keeps record on the list :)
FYI @germanfgv @LinaresToine
logging the current state of this issue:
Yesterday, We took the chance that the two mandatory PRs (https://github.com/dmwm/CMSKubernetes/pull/1409 && https://github.com/dmwm/CMSKubernetes/pull/1451) needed for T0 to start their tests are already in a semifinal stage, and we had a meeting yesterday between me, Andrea, German and Antonio to discuss the new initialization mechanisms of the agents with a small hands on demonstration from my side on how things work currently. During this process I used mostly Docker containers for both CouchDB and WMAgent, because the initialization process for the agent is immutable and independent of the deployment methods, being it Docker or virtual env.
We agreed on another meeting next week when we would have everything merged from our side, such that next time they would be able to perform the hands on activities with guidance on my side.
p.s. on the comment from above.
During the meeting, we also agreed on starting the proper communication with CERN IT in order to asses how safe it is to add the OS UIDs and GIDs in the Docker repositories, as explained in this comment: https://github.com/dmwm/CMSKubernetes/pull/1412#issuecomment-1967363719
I am about to start this communication today, and will include the proper set of people involved from our side.
FYI: @amaltaro @vkuznet @khurtado @anpicci
Again for logging the activity on this topic:
Today we had yet another meeting during which T0 Team was doing the hands on activities. We found several issues like:
- Additional account management needed to happen directly inthe scripts ... will be fixed once we add the T0 account to the CMSKubernetes repository, but for that we'll have to wait for reply from CERN IT
- We failed with CouchDB connection between the agent and the the CouchDB instance from the Docker container.
Once the later is debugged we plan to have another meeting for tomorrow.
FYI: @amaltaro @germanfgv @LinaresToine @anpicci
Logging the activities again.
Today we were having yet another long hands on meeting between me, @LinaresToine and @anpicci. This time we we had to resolve few more bugs such as a
- Mismatch between T0 and Prodction agents configurations - T0 are not configuring ACDC server,
- Typo in fetching Team name from WMAgent.secrets file
- Missing
Tier0.*
account handling ininit.sh
script - Skipping agent config upload step for T0 agents
A PR is coming for resolving all of these.
At the very end we managed to fully deploy and initialize the agent to the very end. But we failed to start the services because of the old licurl
- to - pycurl
backend mismatch (nss vs. openssl). @LinaresToine is about to try later to resolve this issue, so they can start testing the deployment of the T0 related packages and eventual injections.
FYI: @amaltaro
A PR in the CMSKubernetes repository was created to fix isues pointed out in the previous comment. See https://github.com/dmwm/CMSKubernetes/pull/1457.
I continue to look into the libcurl - to - pycurl issue pointed out by @todor-ivanov
Hi @LinaresToine I've left one comment in the Pr. Please take look
hi @LinaresToine while you are removing that line I asked in the review, here is your solution about the backend ssl library mismatch in pycurl. You need to follow both of those steps:
- At the Host:
[root@vocms0290 data]# yum install curl-openssl libcurl-devel libcurl-openssl-devel
- Inside the Virtual Environment:
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ pip uninstall pycurl
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ export PYCURL_SSL_LIBRARY=openssl
(WMAgent.venv3) cmst1@vocms0290:WMAgent.venv3 $ pip install --no-cache-dir --global-option=build_ext --global-option="-L/usr/local/opt/openssl/lib" --global-option="-I/usr/local/opt/openssl/include" pycurl
Thank you very much @todor-ivanov. Regarding the PR, I will modify it right now. I will point out an observation in the PR itself. Regarding the backend solution, I did try the steps you mention inside the virtual environment. I was missing out on the step in the host. Thank you for your insight!!
Hello @todor-ivanov. I created a script to keep track of all the steps taken until the manage start-agent step. This script can be seen in https://github.com/dmwm/T0/pull/4931. Worth clarifying that I am sourcing the script, not executing it.
From our meetings I remember the couchdb-docker-build.sh step is not always necessary, however I include it for now.
The script includes the debugging of the pycurl issue with the environment activated. I may be missing something, but I still get the same error message upon running manage start-agent. I continue to look into it.
hi @LinaresToine
I continue to look into it.
Lets quickly meet again next week and we will try to resolve this together.
There were also few more highlights from @germanfgv yesterday, about the corrections we need to apply to the manage
script in order to avoid bringing the ACDC related variables into the T0 configuration
@todor-ivanov @LinaresToine Can you give an update on this issue? Has there been any more progress since the meeting referenced above?
Hello @klannon Tier 0 recently got an Alma 9 machine, allowing us to successfully deploy the new agent. That is an update regarding the issue with libcurl and pycurl backend misconfigurations. Regarding the fixes in the init and manage scripts, perhaps @todor-ivanov can give an update on that? Such fixes are documented in https://github.com/dmwm/CMSKubernetes/pull/1457
hi @klannon we plan to close this issue this week. There are few more lines that need to get into those fixes, which would be faster if I make them myself. And then we meet with @LinaresToine to complete the process on another hands on meeting.
So, after our long meeting yesterday with @LinaresToine we ended up with with another possible fix addressing more than one problems: https://github.com/dmwm/CMSKubernetes/pull/1466
But there is still few minor details before we close it. I'll push more changes later tonight.
hi @amaltaro @anpicci @vkuznet with our latest commits to https://github.com/dmwm/CMSKubernetes/pull/1466 we ( me @LinaresToine @germanfgv ) managed to initialize a T0 agent alma9 machine properly and test this deployment process during our meeting today. There were some new library issues that have popped up the last minute when German decided to actually start a replay (he or Antonio are going to update the issue with the new error). But the biggest success here was that we managed to finish the deployment properly on top of all the rest of the changes we did lately on the WMAgent containers. Please feel free to start your review on the PR in the CMSKuberenetes repository.
Hello all. As Todor mentions in the comment above, we were able to initialize and start the T0 agent successfully. After starting, the Tier0Feeder component of the agent displays the following error message:
ERROR:StdBase:About to raise exception <@========== WMException Start ==========@> Exception Class: WMSpecFactoryException Message: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren' ClassName : None ModuleName : WMCore.WMSpec.WMWorkloadTools MethodName : _validateArgFunction ClassInstance : None FileName : /data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py LineNumber : 139 ErrorNr : 0
Traceback: File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 130, in _validateArgFunction if not valFunction(value):
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 1034, in
"CMSSWVersion": {"validate": lambda x: x in releases(), File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/ReqMgr/Tools/cms.py", line 219, in releases return TC.releases(arch)
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 89, in releases for row in self.data():
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 83, in data for row in xml_parser(data, pkey):
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 62, in xml_parser get_children(elem, event, row, key)
File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 74, in get_children for child in elem.getchildren():
@amaltaro @vkuznet @todor-ivanov @anpicci @germanfgv
Hi @LinaresToine @todor-ivanov , it seems an error with the XML file itself, rather than a bug, right? In addition, the stacktrace looks like incomplete.
Thank you @anpicci . The WMException ended there after the attribute getchildren was not found. However, I can include this portion also:
2024-05-02 19:02:18,253:139701761603136:ERROR:Tier0FeederPoller:Can't configure for run 359688 and stream Calibration Traceback (most recent call last): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 130, in _validateArgFunction if not valFunction(value): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 1034, in
"CMSSWVersion": {"validate": lambda x: x in releases(), File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/ReqMgr/Tools/cms.py", line 219, in releases return TC.releases(arch) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 89, in releases for row in self.data(): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/TagCollector.py", line 83, in data for row in xml_parser(data, pkey): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 62, in xml_parser get_children(elem, event, row, key) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/Services/TagCollector/XMLUtils.py", line 74, in get_children for child in elem.getchildren(): AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren' During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/StdSpecs/StdBase.py", line 944, in masterValidation validateArgumentsCreate(schema, argumentDefinition) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 276, in validateArgumentsCreate _validateArgumentOptions(arguments, argumentDefinition, "optional") File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 160, in _validateArgumentOptions arguments[arg] = _validateArgument(arg, arguments[arg], argDef) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 101, in _validateArgument _validateArgFunction(argument, value, argumentDefinition["validate"]) File "/data/tier0/WMAgent.venv3/srv/WMCore/src/python/WMCore/WMSpec/WMWorkloadTools.py", line 139, in _validateArgFunction raise WMSpecFactoryException(str(ex))
We are in the process of final tests here. More details I gave in my comment to the PR with which I called @amaltaro and @anpicci for final review: https://github.com/dmwm/CMSKubernetes/pull/1466#issuecomment-2107090214
The xml library have removed the getchildren attribute after version 3.9: https://docs.python.org/3.8/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.getchildren