WMCore
WMCore copied to clipboard
Deploy and run WMAgent with Docker container
Impact of the new feature WMAgent
Is your feature request related to a problem? Please describe. Once we have all the WMAgent dependencies sorted out, we should start looking into running WMAgent from a Docker image, making the process of updating our baseline code easier by just stopping the container and pulling a new one.
Describe the solution you'd like With WMAgent package in PyPi, and a WMAgent docker image uploaded to Gitlab registry, we should complete the cycle such that we are able to run WMAgent using a Docker image, completing the deprecation and dependency on RPM packages.
This is a meta-issue for commissioning a Docker-based WMAgent environment for central production. It depends on the following sub-tasks:
- [x] #11276
- [x] #11565
- [x] #8797
- [x] #11647
- [x] #11564
- [x] #11627
- [x] #11583
- [x] #11570
These are now planned for Q1/2024 (updated on Feb/8/2024):
- [x] #11890
- [x] #11720
- [x] #11313
- [x] #11312
- [x] #11946
And the following shall be finished in the beginning of Q2/2024:
- [x] #11566
- [x] #11567
- [x] <!-- Estimate the need for --network=host in WMAgeent Docker builds --> #11635
- [x] #11722
- [x] #11990
- [x] #11945
- [x] #11944
- [ ] #11568
- [x] #11973
- [x] #11977
- [ ] #11978
- [x] #11979
- [x] #11987
- [x] #11993
- [x] #11999
- [x] #12000
- [x] #12007
- [x] #12030
- [x] #12034
while the following tickets will be addressed in a later stage, as they are not critical for the functionality of the containerized solution:
- [ ] #11925
- [ ] #11934
- [ ] #11927
- [ ] #11721
Describe alternatives you've considered None
Additional context Related to:
- Make CA certificates available in our WMAgent test container #10225
- Unable to start agent in Docker container #9675
Dependent on: https://github.com/dmwm/WMCore/issues/11312 and https://github.com/dmwm/WMCore/issues/11313
@amaltaro we do already have wmagent Dockerfile and image but we do not have wmagent configuration. If you'll provide its configuration in services_config gitlab repo then I can test it in dev cluster.
Valentin, I am afraid we don't have budget to work on it this quarter, hence it has not been even considered for Q1/2023. We will get back to this in one of the next quarters.
Meta-issue updated with other relevant tickets created in the last week or two. Please let me know if anything is still missing in here. @vkuznet @todor-ivanov @khurtado
@amaltaro , I think we are missing the part of deployment on a specific infrastructure. For example, if we'll deploy WMA using docker compose, then we'll need specific manifest file(s), etc. On top of that we should make a decision which orchestration to use. For details between different solution I suggest to google it, e.g. here is nice comparison. Therefore, I would suggest to create a dedicated issue related to that.
@vkuznet Valentin, that is a good point! We will need to have a final manifest/template integrating all the objects/containers required for running WMAgent in a containerized mode. If you feel like creating it, please go ahead. Otherwise I will get to it tomorrow.
@todor-ivanov thank you for updating this meta issue with the new tickets resulted from: https://github.com/dmwm/CMSKubernetes/pull/1410
As we discussed yesterday during the WMCore meeting, based on my previous experience with this containerization effort, I can point which are the most crucial and with highest priority issues to be resolved in order for WMCore Team to deliver this new functionality in the shortest possible timeline. I'll also add my high level view/feedback on the rest of the issues, but this will need to be confirmed during the dedicated effort on resolving them.
The 3 most important issues with absolutely no alternative for a workaround solution are:
-
https://github.com/dmwm/WMCore/issues/11720 - All of the database functionalities which are needed here are already well tested and working in the context of MariaDB, but we must add the needed Oracle alternatives to the
manage
script. One possible obstacle could be, the creation of an initial database schema dump and recording it during the deployment procedures for later validation and/or recovery. But this I believe is solvable. -
https://github.com/dmwm/WMCore/issues/11313 - This is a must as well, but we already have some markers and basic construction on how should this docker image look like here: https://github.com/dmwm/CMSKubernetes/pull/1412 . The default configurations from my.cnf here and the MariaDB server startup script are confirmed to be working on a direct deployment at the host. I also remember having the MariaDB container built and tested locally, but without the relevant
manage
andrun.sh
scripts for automating its startup. -
https://github.com/dmwm/WMCore/issues/11312 - This is also unavoidable, but again we already have a 99% completed PR resolving the issue: https://github.com/dmwm/CMSKubernetes/pull/1409 The container, built based on this PR, is well tested locally and working, but it is not uploaded to CERN registry.
Here follows the set of issues with feedback, which can be resolved with lower than maximum priority (as I said earlier, what ever I state here needs to be confirmed with a dedicated investigation/checks on the topic):
-
https://github.com/dmwm/WMCore/issues/11721 - This is good to have in the context of code organization, but things are working at the moment, even with creation of the init table during deployment.
-
https://github.com/dmwm/WMCore/issues/11722 - This is again good to have in order to avoid manual work for adding those resources upon agent deployment, but if we decide, for the time being, we can live with some extra commands which need to be executed at the agent before we start it in production, we can resolve this issue with a medium priority as well.
-
https://github.com/dmwm/WMCore/issues/11566 - So far, I have not observed any signs of the agent's performance being somehow affected if MariaDB is run inside a container vs. directly at the host. Of course I did not have the chance to do a high scale load test, but I am confident, we will have no scalability issues in that regard.
-
https://github.com/dmwm/WMCore/issues/11312 - Same as above.
-
https://github.com/dmwm/WMCore/issues/11635 - The short answer to this issue is:
Yes
,for the time being, we need to use the --network=host
option for the WMAgent container at runtime, andNO
it is not needed at container build time. At least until we decide to completely move to a fully distributed solution and deploy WMAgents in kuberenetes clusters. But from what I see, there is a long way until we reach to that point. -
https://github.com/dmwm/WMCore/issues/11568 - This is indeed a must for us to have a stable CI/CD process, but this for sure will come last. And how difficult would that effort be, given the resolution of all other issues, I'd let @amaltaro or @vkuznet take a l look and share their opinion.
FYI @klannon @amaltaro
Thank you for sharing these insights, Todor.
I will soon start moving a few of these items under the Q1 quarter then, following your suggestion of: #11720, #11313, #11312. In addition, I inclined to say that the CI/Jenkins pipeline (issue #11568) should be updated accordingly and before we actually implement this new model in production, otherwise I fear it will be easier to insert issues into production. We also need to pay closer attention to the Tier0 agent model and make sure that whenever central production is ready to migrate to the latest OS, that Tier0 agent can come along as well. For that, I created this new (meta-)issue: https://github.com/dmwm/WMCore/issues/11890
https://github.com/dmwm/WMCore/issues/11635 - The short answer to this issue is: Yes ,for the time being, we need to use the --network=host option for the WMAgent container at runtime, and NO it is not needed at container build time. At least until we decide to completely move to a fully distributed solution and deploy WMAgents in kuberenetes clusters. But from what I see, there is a long way until we reach to that point.
This is an important information and we better update the GH issue itself with this. Can you please update it?
hi @amaltaro
I inclined to say that the CI/Jenkins pipeline (issue https://github.com/dmwm/WMCore/issues/11568) should be updated accordingly and before we actually implement this new model in production, otherwise I fear it will be easier to insert issues into production.
Well then, it makes it 4 of them to get in as high priority. I am still positive about the outcome.