WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Pre-production integration and validation of WMAgent containerization solution

Open amaltaro opened this issue 11 months ago • 6 comments

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. Once we are ready with the required developments, we need to integrate the multiple containers (for databases and WMAgent) and run an integration and pre-production validation.

Describe the solution you'd like Run pre-production validation of the new WMAgent deployment model based on docker images.

The following checklist is supposed to cover the most common and important features, and each one of these items must be validated.

AT CERN:

  • [ ] CouchDB manage status/stop/start/restart
  • [ ] Oracle prompt use manage db-prompt wmagent
  • [ ] Data persistence - and logs - between CouchDB container restarts
  • [ ] Proxy renewal setup
  • [ ] Cronjob setup
  • [ ] Configuration tweaks for a testbed agent (e.g. Rucio url, MaxRetries)
  • [ ] Configuration tweaks for a production agent (e.g. Rucio url, MaxRetries)
  • [ ] Opportunistic resources

in addition to the following workflow-related tests:

  • [ ] Run a Monte Carlo from scratch
  • [ ] Run a workflow with pileup data
  • [ ] Run a ReReco workflow
  • [ ] Abort a workflow with jobs in condor
  • [ ] Force complete a workflow with jobs in condor
  • [ ] Run a workflow that creates Rucio output data placement
  • [ ] Force JobStatusLite to kill jobs in Condor
  • [ ] Update a workflow priority and ensure Pending jobs in the condor pool are properly updated as well

then in an agent at CERN:

  • [ ] Deploy an agent + couchdb + oracle from scratch - and run 1 workflow
  • [ ] Restart the agent container - and run another workflow
  • [ ] Restart CouchDB container - and run another workflow
  • [ ] Restart the whole node while containers are running and see how resilient the system is
  • [ ] Create a patch tag at the WMCore repository, build the new Docker image, upload it to CERN registry and then redeploy on the previously running agent. A re-initialization process from scratch must NOT start if the new tag is a patch version or a release candidate.

AT FNAL:

  • [ ] CouchDB manage status/stop/start/restart
  • [ ] MariaDB manage status/stop/start/restart
  • [ ] MariaDB prompt use manage db-prompt wmagent
  • [ ] Data persistence - and logs - between CouchDB container restarts
  • [ ] Data persistence - and logs - between MariaDB container restarts
  • [ ] Proxy renewal setup
  • [ ] Cronjob setup
  • [ ] Configuration tweaks for a testbed agent (e.g. Rucio url, MaxRetries)
  • [ ] Configuration tweaks for a production agent (e.g. Rucio url, MaxRetries)
  • [ ] Opportunistic resources

in addition to the following workflow-related tests:

  • [ ] Run a Monte Carlo from scratch
  • [ ] Run a workflow with pileup data
  • [ ] Run a ReReco workflow
  • [ ] Abort a workflow with jobs in condor
  • [ ] Force complete a workflow with jobs in condor
  • [ ] Run a workflow that creates Rucio output data placement
  • [ ] Force JobStatusLite to kill jobs in Condor
  • [ ] Update a workflow priority and ensure Pending jobs in the condor pool are properly updated as well

lastly, in a Fermilab agent:

  • [ ] Deploy an agent + couchdb + mariadb from scratch - and run 1 workflow
  • [ ] Restart the agent container - and run another workflow
  • [ ] Restart CouchDB container - and run another workflow
  • [ ] Restart MariaDB container - and run another workflow
  • [ ] Restart the whole node while containers are running and see how resilient the system is (probably not possible at FNAL)
  • [ ] - [ ] Create a patch tag at the WMCore repository, build the new Docker image, upload it to CERN registry and then redeploy on the previously running agent. A re-initialization process from scratch must NOT start if the new tag is a patch version or a release candidate.

NOTE that all these tests need to be performed with an official image uploaded to CERN registry, once we have identified a stable image to perform such tests.

Describe alternatives you've considered We can request the T0 team to run Tier0 related validations as well.

Additional context Sub-task for this meta issue: https://github.com/dmwm/WMCore/issues/11314

amaltaro avatar Mar 26 '24 01:03 amaltaro

For clarification and future reference, the plan is that @todor-ivanov will work on the CERN side, and me on the FNAL one.

We can proceed further as soon as https://github.com/dmwm/WMCore/issues/11944 is closed

anpicci avatar Apr 29 '24 13:04 anpicci

@todor-ivanov @anpicci I have updated the original description of this issue and made a check list of everything that needs to be considered during this evaluation phase. Please let me know if anything needs further clarification; likewise, if you see important tests that are not in the checklist, please add them.

Note that we are not yet ready to start this validation, as there has been changes made this week and a few others are still under development.

amaltaro avatar May 17 '24 00:05 amaltaro

@amaltaro thanks! I am currently working on the last 4 points in the list you provided, have you had the chance to look at the error I am getting when injecting a workflow?

anpicci avatar May 17 '24 07:05 anpicci

@anpicci thanks for looking into those. However, I do think we will have to repeat those tests in the next days, as there has been many changes involving the containers in the last days (and a few more coming up really soon).

About the USER error, I will give the other FNAL node a try as well. But as we are discussing on mattermost, it's strange that it only happens in the FNAL node.

amaltaro avatar May 17 '24 14:05 amaltaro

Hi @amaltaro, I know that we aren't already at a point where we can run the final tests, but I started to play a bit with them for my education.

For CouchDB, I see that the manage stop has no real effect on the database. I have tried nevertheless to run manage status and manage restart, and I am reporting what I get in both cases:

manage status

3.2.2 is RUNNING
[{"node":"nonode@nohost","pid":"<0.13653.381>","process_status":"waiting","changes_pending":0,"checkpoint_interval":120000,"checkpointed_source_seq":"27-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw","continuous":true,"database":"shards/80000000-ffffffff/_replicator.1715183192","doc_id":"a305436349958fbda15e56711d87ba36","doc_write_failures":0,"docs_read":16,"docs_written":16,"missing_revisions_found":16,"replication_id":"1f8e8cac41eb6149332bbbd1225b6b65+continuous","revisions_checked":16,"source":"http://127.0.0.1:5984/wmagent_summary/","source_seq":"27-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw","started_on":1716238328,"target":"https://cmsweb-testbed.cern.ch/couchdb/wmstats/","through_seq":"27-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw","type":"replication","updated_on":1716368304,"user":null},{"node":"nonode@nohost","pid":"<0.14469.381>","process_status":"waiting","changes_pending":0,"checkpoint_interval":120000,"checkpointed_source_seq":"228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA","continuous":true,"database":"shards/00000000-7fffffff/_replicator.1715183192","doc_id":"a305436349958fbda15e56711d87d0e6","doc_write_failures":0,"docs_read":185,"docs_written":185,"missing_revisions_found":185,"replication_id":"326d9d2c703ae43767df3bb3b05b91fa+continuous","revisions_checked":203,"source":"http://127.0.0.1:5984/workqueue_inbox/","source_seq":"228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA","started_on":1716238325,"target":"https://cmsweb-testbed.cern.ch/couchdb/workqueue/","through_seq":"228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA","type":"replication","updated_on":1716368248,"user":null},{"node":"nonode@nohost","pid":"<0.3163.415>","process_status":"waiting","changes_pending":0,"checkpoint_interval":120000,"checkpointed_source_seq":"144168-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBi6j-QCxdhNEs3TzCxMsenBY1IeC5BkaABS_2EGMi5PARtonJJkYJJkhk1rFgBY4Sqt","continuous":true,"database":"shards/80000000-ffffffff/_replicator.1715183192","doc_id":"a305436349958fbda15e56711d87c6e6","doc_write_failures":0,"docs_read":18,"docs_written":18,"missing_revisions_found":18,"replication_id":"28c6915117e0403ca970b050ed2feb37+continuous","revisions_checked":203,"source":"https://cmsweb-testbed.cern.ch/couchdb/workqueue/","source_seq":"144168-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBi6j-QCxdhNEs3TzCxMsenBY1IeC5BkaABS_2EGMi5PARtonJJkYJJkhk1rFgBY4Sqt","started_on":1716323218,"target":"http://127.0.0.1:5984/workqueue_inbox/","through_seq":"144168-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBi6j-QCxdhNEs3TzCxMsenBY1IeC5BkaABS_2EGMi5PARtonJJkYJJkhk1rFgBY4Sqt","type":"replication","updated_on":1716368259,"user":null}]
Replication 'id=a305436349958fbda15e56711d87ba36 source=http://cmsdataops:[email protected]:5984/wmagent_summary target=https://cmsweb-testbed.cern.ch/couchdb/wmstats filter=WMStatsAgent/repfilter' unknown.
Replication 'id=a305436349958fbda15e56711d87c6e6 source=https://cmsweb-testbed.cern.ch/couchdb/workqueue target=http://cmsdataops:[email protected]:5984/workqueue_inbox filter=WorkQueue/queueFilter' unknown.
Replication 'id=a305436349958fbda15e56711d87d0e6 source=http://cmsdataops:[email protected]:5984/workqueue_inbox target=https://cmsweb-testbed.cern.ch/couchdb/workqueue filter=WorkQueue/queueFilter' unknown.

manage restart

Stopping CouchDB service...
Which couchdb: /opt/couchdb/bin/couchdb
  With configuration directory: /data/srv/couchdb/3.2.2/config
  With logdir: /data/srv/couchdb/3.2.2/logs
  nohup couchdb -couch_ini /data/srv/couchdb/3.2.2/config >> /data/srv/couchdb/3.2.2/logs/couch.log 2>&1 &
(CouchDB-3.2.2) [cmsdataops@cmssrv810:data]$ manage status
3.2.2 is RUNNING
[{"node":"nonode@nohost","pid":"<0.13653.381>","process_status":"waiting","changes_pending":0,"checkpoint_interval":120000,"checkpointed_source_seq":"27-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw","continuous":true,"database":"shards/80000000-ffffffff/_replicator.1715183192","doc_id":"a305436349958fbda15e56711d87ba36","doc_write_failures":0,"docs_read":16,"docs_written":16,"missing_revisions_found":16,"replication_id":"1f8e8cac41eb6149332bbbd1225b6b65+continuous","revisions_checked":16,"source":"http://127.0.0.1:5984/wmagent_summary/","source_seq":"27-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw","started_on":1716238328,"target":"https://cmsweb-testbed.cern.ch/couchdb/wmstats/","through_seq":"27-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw","type":"replication","updated_on":1716368424,"user":null},{"node":"nonode@nohost","pid":"<0.14469.381>","process_status":"waiting","changes_pending":0,"checkpoint_interval":120000,"checkpointed_source_seq":"228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA","continuous":true,"database":"shards/00000000-7fffffff/_replicator.1715183192","doc_id":"a305436349958fbda15e56711d87d0e6","doc_write_failures":0,"docs_read":185,"docs_written":185,"missing_revisions_found":185,"replication_id":"326d9d2c703ae43767df3bb3b05b91fa+continuous","revisions_checked":203,"source":"http://127.0.0.1:5984/workqueue_inbox/","source_seq":"228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA","started_on":1716238325,"target":"https://cmsweb-testbed.cern.ch/couchdb/workqueue/","through_seq":"228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA","type":"replication","updated_on":1716368368,"user":null},{"node":"nonode@nohost","pid":"<0.3163.415>","process_status":"waiting","changes_pending":0,"checkpoint_interval":120000,"checkpointed_source_seq":"144170-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBi6j-YCxdhNEs3TzCxMsenBY1IeC5BkaABS_2EGMi5PBRtonJJkYJJkhk1rFgBZdiqv","continuous":true,"database":"shards/80000000-ffffffff/_replicator.1715183192","doc_id":"a305436349958fbda15e56711d87c6e6","doc_write_failures":0,"docs_read":18,"docs_written":18,"missing_revisions_found":18,"replication_id":"28c6915117e0403ca970b050ed2feb37+continuous","revisions_checked":203,"source":"https://cmsweb-testbed.cern.ch/couchdb/workqueue/","source_seq":"144170-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBi6j-YCxdhNEs3TzCxMsenBY1IeC5BkaABS_2EGMi5PBRtonJJkYJJkhk1rFgBZdiqv","started_on":1716323218,"target":"http://127.0.0.1:5984/workqueue_inbox/","through_seq":"144170-g1AAAACheJzLYWBgYMpgTmEQTM4vTc5ISXIwNDLXMwBCwxyQVCJDUv3___-zMpiTGBi6j-YCxdhNEs3TzCxMsenBY1IeC5BkaABS_2EGMi5PBRtonJJkYJJkhk1rFgBZdiqv","type":"replication","updated_on":1716368380,"user":null}]
Replication 'id=a305436349958fbda15e56711d87ba36 source=http://cmsdataops:[email protected]:5984/wmagent_summary target=https://cmsweb-testbed.cern.ch/couchdb/wmstats filter=WMStatsAgent/repfilter' unknown.
Replication 'id=a305436349958fbda15e56711d87c6e6 source=https://cmsweb-testbed.cern.ch/couchdb/workqueue target=http://cmsdataops:[email protected]:5984/workqueue_inbox filter=WorkQueue/queueFilter' unknown.
Replication 'id=a305436349958fbda15e56711d87d0e6 source=http://cmsdataops:[email protected]:5984/workqueue_inbox target=https://cmsweb-testbed.cern.ch/couchdb/workqueue filter=WorkQueue/queueFilter' unknown.
(CouchDB-3.2.2) [cmsdataops@cmssrv810:data]$ manage stop
Stopping CouchDB service...

What I read in the CouchDB elog

[info] 2024-05-22T09:12:04.901281Z nonode@nohost <0.31383.450> -------- Starting index update for db: shards/80000000-ffffffff/workqueue.1715183197 idx: _design/WorkQueue
[info] 2024-05-22T09:12:04.967377Z nonode@nohost <0.31383.450> -------- Index update finished for db: shards/80000000-ffffffff/workqueue.1715183197 idx: _design/WorkQueue
[notice] 2024-05-22T09:12:22.035233Z nonode@nohost <0.17303.402> 236a03a418 127.0.0.1:5984 127.0.0.1 cmsdataops GET /workqueue_inbox/_changes?filter=WorkQueue%2FqueueFilter&parentUrl=https%3A%2F%2Fcmsweb-testbed.cern.ch%2Fcouchdb%2Fworkqueue&childUrl=http%3A%2F%2Fcmssrv810.fnal.gov%3A5984&feed=continuous&style=all_docs&since=%22228-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5MSAXKMCelJRoamhkjq4Yh_Y8FiDJ0ACk_kNNmQI2xdwyKcXSOAVdTxYAg-QrmA%22&timeout=300000 200 ok 300004
[notice] 2024-05-22T09:12:34.413893Z nonode@nohost <0.9612.451> 297a8a217b 127.0.0.1:5984 127.0.0.1 undefined POST /_session 200 ok 3
[notice] 2024-05-22T09:12:34.415280Z nonode@nohost <0.9612.451> 7db9c3d007 127.0.0.1:5984 127.0.0.1 cmsdataops GET /workqueue_inbox/ 200 ok 1
[notice] 2024-05-22T09:12:34.417921Z nonode@nohost <0.9612.451> b95a24c766 127.0.0.1:5984 127.0.0.1 cmsdataops GET /workqueue_inbox/_design/WorkQueue 200 ok 2
[notice] 2024-05-22T09:12:37.013854Z nonode@nohost <0.9612.451> db20f6e965 127.0.0.1:5984 127.0.0.1 undefined POST /_session 200 ok 1
[notice] 2024-05-22T09:12:37.015289Z nonode@nohost <0.9612.451> e8c86cc2d5 127.0.0.1:5984 127.0.0.1 cmsdataops GET /wmagent_summary/ 200 ok 1
[notice] 2024-05-22T09:12:37.016550Z nonode@nohost <0.9612.451> 05095ccba5 127.0.0.1:5984 127.0.0.1 cmsdataops GET /wmagent_summary/_design/WMStatsAgent 200 ok 1
[info] 2024-05-22T09:13:04.831266Z nonode@nohost <0.249.0> -------- Preflight check: Checking For Monsters

[info] 2024-05-22T09:13:04.833432Z nonode@nohost <0.249.0> -------- Preflight check: Asserting Admin Account

[info] 2024-05-22T09:13:04.836510Z nonode@nohost <0.249.0> -------- Apache CouchDB 3.2.2 is starting.

[info] 2024-05-22T09:13:04.836584Z nonode@nohost <0.250.0> -------- Starting couch_sup
Failure to start Mochiweb: eaddrinuse
[error] 2024-05-22T09:13:04.897103Z nonode@nohost <0.339.0> -------- CRASH REPORT Process  (<0.339.0>) with 0 neighbors exited with reason: eaddrinuse at gen_server:init_it/6(line:401) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {mochiweb_socket_server,init,['Argument__1']}, ancestors: [couch_secondary_services,couch_sup,<0.249.0>], message_queue_len: 0, links: [<0.326.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 42868
[error] 2024-05-22T09:13:04.897289Z nonode@nohost <0.339.0> -------- CRASH REPORT Process  (<0.339.0>) with 0 neighbors exited with reason: eaddrinuse at gen_server:init_it/6(line:401) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {mochiweb_socket_server,init,['Argument__1']}, ancestors: [couch_secondary_services,couch_sup,<0.249.0>], message_queue_len: 0, links: [<0.326.0>], dictionary: [], trap_exit: true, status: running, heap_size: 6772, stack_size: 28, reductions: 42868
[error] 2024-05-22T09:13:04.897372Z nonode@nohost <0.326.0> -------- Supervisor couch_secondary_services had child httpsd started with chttpd:start_link(https) at undefined exit with reason eaddrinuse in context start_error
[error] 2024-05-22T09:13:04.897462Z nonode@nohost <0.326.0> -------- Supervisor couch_secondary_services had child httpsd started with chttpd:start_link(https) at undefined exit with reason eaddrinuse in context start_error
[error] 2024-05-22T09:13:04.898683Z nonode@nohost <0.250.0> -------- Supervisor couch_sup had child couch_secondary_services started with couch_secondary_sup:start_link() at undefined exit with reason {shutdown,{failed_to_start_child,httpsd,eaddrinuse}} in context start_error
[error] 2024-05-22T09:13:04.898965Z nonode@nohost <0.250.0> -------- Supervisor couch_sup had child couch_secondary_services started with couch_secondary_sup:start_link() at undefined exit with reason {shutdown,{failed_to_start_child,httpsd,eaddrinuse}} in context start_error
[error] 2024-05-22T09:13:04.899907Z nonode@nohost <0.249.0> -------- Error starting Apache CouchDB:

    {error,{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}}}


[error] 2024-05-22T09:13:04.900245Z nonode@nohost <0.248.0> -------- CRASH REPORT Process  (<0.248.0>) with 0 neighbors exited with reason: {{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}},{couch_app,start,[normal,[]]}} at application_master:init/4(line:138) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {application_master,init,['Argument__1','Argument__2',...]}, ancestors: [<0.247.0>], message_queue_len: 1, links: [<0.247.0>,<0.16.0>], dictionary: [], trap_exit: true, status: running, heap_size: 610, stack_size: 28, reductions: 220
[error] 2024-05-22T09:13:04.900371Z nonode@nohost <0.248.0> -------- CRASH REPORT Process  (<0.248.0>) with 0 neighbors exited with reason: {{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}},{couch_app,start,[normal,[]]}} at application_master:init/4(line:138) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {application_master,init,['Argument__1','Argument__2',...]}, ancestors: [<0.247.0>], message_queue_len: 1, links: [<0.247.0>,<0.16.0>], dictionary: [], trap_exit: true, status: running, heap_size: 610, stack_size: 28, reductions: 220
[info] 2024-05-22T09:13:04.900449Z nonode@nohost <0.16.0> -------- Application couch exited with reason: {{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}},{couch_app,start,[normal,[]]}}
[info] 2024-05-22T09:13:04.900511Z nonode@nohost <0.16.0> -------- Application couch exited with reason: {{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}},{couch_app,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,couch,{{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}},{couch_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,couch,{{shutdown,{failed_to_start_child,couch_secondary_services,{shutdown,{failed_to_start_child,httpsd,eaddrinuse}}}},{couc

Crash dump is being written to: erl_crash.dump...[notice] 2024-05-22T09:13:09.008233Z nonode@nohost <0.9923.451> 6e221510ae 127.0.0.1:5984 127.0.0.1 cmsdataops GET /workqueue/_design/WorkQueue/_view/elementsDetailByWorkflowAndStatus?reduce=false&stale=update_after 200 ok 3
[notice] 2024-05-22T09:13:27.178022Z nonode@nohost <0.9272.436> bf96027783 127.0.0.1:5984 127.0.0.1 cmsdataops GET /wmagent_summary/_changes?filter=WMStatsAgent%2Frepfilter&feed=continuous&style=all_docs&since=%2227-g1AAAACLeJzLYWBgYMpgTmHgzcvPy09JdcjLz8gvLskBCScyJNX___8_K4M5kT0XKMCemGyamGqciq4Yh_Y8FiDJ0ACk_kNNEQGbkmZskppmaISuJwsAen0rJw%22&timeout=300000 200 ok 300003
[notice] 2024-05-22T09:13:35.421756Z nonode@nohost <0.9612.451> 06b1fb622d 127.0.0.1:5984 127.0.0.1 undefined POST /_session 200 ok 1
[notice] 2024-05-22T09:13:35.423044Z nonode@nohost <0.9612.451> 6d5d2a3f69 127.0.0.1:5984 127.0.0.1 cmsdataops GET /workqueue_inbox/ 200 ok 1
[notice] 2024-05-22T09:13:35.425577Z nonode@nohost <0.9612.451> de6e5e51bc 127.0.0.1:5984 127.0.0.1 cmsdataops GET /workqueue_inbox/_design/WorkQueue 200 ok 2

When restarting CouchDB, it seems like there is a crash. Is it relevant for us?

anpicci avatar May 22 '24 09:05 anpicci

Thanks for reporting this, @anpicci . This issue has been fixed yesterday or the day before, but the image is yet to be recreated. We have other changes in the pipeline, so I am waiting for those to be ready/reviewed before building a new image. UPDATE: here is the ticket fixing it: https://github.com/dmwm/CMSKubernetes/pull/1484

amaltaro avatar May 22 '24 14:05 amaltaro

We have covered all of the items in the check-list, other than the cron setup, which is being addressed/tested with https://github.com/dmwm/WMCore/issues/12000 Thanks to Andrea and everyone else who helped with this validation!

amaltaro avatar Jun 05 '24 13:06 amaltaro