WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Deploy and run WMAgent with Docker container

Open amaltaro opened this issue 2 years ago • 9 comments

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. Once we have all the WMAgent dependencies sorted out, we should start looking into running WMAgent from a Docker image, making the process of updating our baseline code easier by just stopping the container and pulling a new one.

Describe the solution you'd like With WMAgent package in PyPi, and a WMAgent docker image uploaded to Gitlab registry, we should complete the cycle such that we are able to run WMAgent using a Docker image, completing the deprecation and dependency on RPM packages.

This is a meta-issue for commissioning a Docker-based WMAgent environment for central production. It depends on the following sub-tasks:

  • [x] #11276
  • [x] #11565
  • [x] #8797
  • [x] #11647
  • [x] #11564
  • [x] #11627
  • [x] #11583
  • [x] #11570

These are now planned for Q1/2024 (updated on Feb/8/2024):

  • [x] #11890
  • [x] #11720
  • [x] #11313
  • [x] #11312
  • [x] #11946

And the following shall be finished in the beginning of Q2/2024:

  • [x] #11566
  • [x] #11567
  • [x] <!-- Estimate the need for --network=host in WMAgeent Docker builds --> #11635
  • [x] #11722
  • [x] #11990
  • [x] #11945
  • [x] #11944
  • [ ] #11568
  • [x] #11973
  • [x] #11977
  • [ ] #11978
  • [x] #11979
  • [x] #11987
  • [x] #11993
  • [x] #11999
  • [x] #12000
  • [x] #12007
  • [x] #12030
  • [x] #12034

while the following tickets will be addressed in a later stage, as they are not critical for the functionality of the containerized solution:

  • [ ] #11925
  • [ ] #11934
  • [ ] #11927
  • [ ] #11721

Describe alternatives you've considered None

Additional context Related to:

  • Make CA certificates available in our WMAgent test container #10225
  • Unable to start agent in Docker container #9675

Dependent on: https://github.com/dmwm/WMCore/issues/11312 and https://github.com/dmwm/WMCore/issues/11313

amaltaro avatar Oct 03 '22 19:10 amaltaro

@amaltaro we do already have wmagent Dockerfile and image but we do not have wmagent configuration. If you'll provide its configuration in services_config gitlab repo then I can test it in dev cluster.

vkuznet avatar Jan 09 '23 13:01 vkuznet

Valentin, I am afraid we don't have budget to work on it this quarter, hence it has not been even considered for Q1/2023. We will get back to this in one of the next quarters.

amaltaro avatar Jan 09 '23 14:01 amaltaro

Meta-issue updated with other relevant tickets created in the last week or two. Please let me know if anything is still missing in here. @vkuznet @todor-ivanov @khurtado

amaltaro avatar May 09 '23 14:05 amaltaro

@amaltaro , I think we are missing the part of deployment on a specific infrastructure. For example, if we'll deploy WMA using docker compose, then we'll need specific manifest file(s), etc. On top of that we should make a decision which orchestration to use. For details between different solution I suggest to google it, e.g. here is nice comparison. Therefore, I would suggest to create a dedicated issue related to that.

vkuznet avatar May 09 '23 16:05 vkuznet

@vkuznet Valentin, that is a good point! We will need to have a final manifest/template integrating all the objects/containers required for running WMAgent in a containerized mode. If you feel like creating it, please go ahead. Otherwise I will get to it tomorrow.

amaltaro avatar May 09 '23 17:05 amaltaro

@todor-ivanov thank you for updating this meta issue with the new tickets resulted from: https://github.com/dmwm/CMSKubernetes/pull/1410

amaltaro avatar Sep 25 '23 13:09 amaltaro

As we discussed yesterday during the WMCore meeting, based on my previous experience with this containerization effort, I can point which are the most crucial and with highest priority issues to be resolved in order for WMCore Team to deliver this new functionality in the shortest possible timeline. I'll also add my high level view/feedback on the rest of the issues, but this will need to be confirmed during the dedicated effort on resolving them.

The 3 most important issues with absolutely no alternative for a workaround solution are:

  • https://github.com/dmwm/WMCore/issues/11720 - All of the database functionalities which are needed here are already well tested and working in the context of MariaDB, but we must add the needed Oracle alternatives to the manage script. One possible obstacle could be, the creation of an initial database schema dump and recording it during the deployment procedures for later validation and/or recovery. But this I believe is solvable.

  • https://github.com/dmwm/WMCore/issues/11313 - This is a must as well, but we already have some markers and basic construction on how should this docker image look like here: https://github.com/dmwm/CMSKubernetes/pull/1412 . The default configurations from my.cnf here and the MariaDB server startup script are confirmed to be working on a direct deployment at the host. I also remember having the MariaDB container built and tested locally, but without the relevant manage and run.sh scripts for automating its startup.

  • https://github.com/dmwm/WMCore/issues/11312 - This is also unavoidable, but again we already have a 99% completed PR resolving the issue: https://github.com/dmwm/CMSKubernetes/pull/1409 The container, built based on this PR, is well tested locally and working, but it is not uploaded to CERN registry.

Here follows the set of issues with feedback, which can be resolved with lower than maximum priority (as I said earlier, what ever I state here needs to be confirmed with a dedicated investigation/checks on the topic):

  • https://github.com/dmwm/WMCore/issues/11721 - This is good to have in the context of code organization, but things are working at the moment, even with creation of the init table during deployment.

  • https://github.com/dmwm/WMCore/issues/11722 - This is again good to have in order to avoid manual work for adding those resources upon agent deployment, but if we decide, for the time being, we can live with some extra commands which need to be executed at the agent before we start it in production, we can resolve this issue with a medium priority as well.

  • https://github.com/dmwm/WMCore/issues/11566 - So far, I have not observed any signs of the agent's performance being somehow affected if MariaDB is run inside a container vs. directly at the host. Of course I did not have the chance to do a high scale load test, but I am confident, we will have no scalability issues in that regard.

  • https://github.com/dmwm/WMCore/issues/11312 - Same as above.

  • https://github.com/dmwm/WMCore/issues/11635 - The short answer to this issue is: Yes ,for the time being, we need to use the --network=host option for the WMAgent container at runtime, and NO it is not needed at container build time. At least until we decide to completely move to a fully distributed solution and deploy WMAgents in kuberenetes clusters. But from what I see, there is a long way until we reach to that point.

  • https://github.com/dmwm/WMCore/issues/11568 - This is indeed a must for us to have a stable CI/CD process, but this for sure will come last. And how difficult would that effort be, given the resolution of all other issues, I'd let @amaltaro or @vkuznet take a l look and share their opinion.

FYI @klannon @amaltaro

todor-ivanov avatar Feb 06 '24 09:02 todor-ivanov

Thank you for sharing these insights, Todor.

I will soon start moving a few of these items under the Q1 quarter then, following your suggestion of: #11720, #11313, #11312. In addition, I inclined to say that the CI/Jenkins pipeline (issue #11568) should be updated accordingly and before we actually implement this new model in production, otherwise I fear it will be easier to insert issues into production. We also need to pay closer attention to the Tier0 agent model and make sure that whenever central production is ready to migrate to the latest OS, that Tier0 agent can come along as well. For that, I created this new (meta-)issue: https://github.com/dmwm/WMCore/issues/11890

https://github.com/dmwm/WMCore/issues/11635 - The short answer to this issue is: Yes ,for the time being, we need to use the --network=host option for the WMAgent container at runtime, and NO it is not needed at container build time. At least until we decide to completely move to a fully distributed solution and deploy WMAgents in kuberenetes clusters. But from what I see, there is a long way until we reach to that point.

This is an important information and we better update the GH issue itself with this. Can you please update it?

amaltaro avatar Feb 06 '24 15:02 amaltaro

hi @amaltaro

I inclined to say that the CI/Jenkins pipeline (issue https://github.com/dmwm/WMCore/issues/11568) should be updated accordingly and before we actually implement this new model in production, otherwise I fear it will be easier to insert issues into production.

Well then, it makes it 4 of them to get in as high priority. I am still positive about the outcome.

todor-ivanov avatar Feb 06 '24 18:02 todor-ivanov