DeepSea icon indicating copy to clipboard operation
DeepSea copied to clipboard

Stage 3 sometimes hangs in the CI

Open smithfarm opened this issue 5 years ago • 43 comments

Description of Issue/Question

In the master branch CI, we are seeing frequent, but transient, Stage 3 hangs/timeouts. They do not seem to be associated with any particular test.

2018-09-08T09:28:48.937 INFO:teuthology.orchestra.run.target167114239053.stderr:+ echo 'WWWW: tuned will NOT be deployed'
2018-09-08T09:28:48.937 INFO:teuthology.orchestra.run.target167114239053.stderr:+ _disable_tuned
2018-09-08T09:28:48.937 INFO:teuthology.orchestra.run.target167114239053.stderr:+ local prefix=/srv/salt/ceph/tuned
2018-09-08T09:28:48.937 INFO:teuthology.orchestra.run.target167114239053.stderr:+ mv /srv/salt/ceph/tuned/mgr/default.sls /srv/salt/ceph/tuned/mgr/default.sls-MOVED
2018-09-08T09:28:48.938 INFO:teuthology.orchestra.run.target167114239053.stderr:+ mv /srv/salt/ceph/tuned/mon/default.sls /srv/salt/ceph/tuned/mon/default.sls-MOVED
2018-09-08T09:28:48.939 INFO:teuthology.orchestra.run.target167114239053.stderr:+ mv /srv/salt/ceph/tuned/osd/default.sls /srv/salt/ceph/tuned/osd/default.sls-MOVED
2018-09-08T09:28:48.940 INFO:teuthology.orchestra.run.target167114239053.stderr:+ mv /srv/salt/ceph/tuned/mgr/default-off.sls /srv/salt/ceph/tuned/mgr/default.sls
2018-09-08T09:28:48.942 INFO:teuthology.orchestra.run.target167114239053.stderr:+ mv /srv/salt/ceph/tuned/mon/default-off.sls /srv/salt/ceph/tuned/mon/default.sls
2018-09-08T09:28:48.943 INFO:teuthology.orchestra.run.target167114239053.stderr:+ mv /srv/salt/ceph/tuned/osd/default-off.sls /srv/salt/ceph/tuned/osd/default.sls
2018-09-08T09:28:48.944 INFO:teuthology.orchestra.run.target167114239053.stdout:
2018-09-08T09:28:48.945 INFO:teuthology.orchestra.run.target167114239053.stdout:*********************************************
2018-09-08T09:28:48.945 INFO:teuthology.orchestra.run.target167114239053.stdout:********** Running DeepSea Stage 3 **********
2018-09-08T09:28:48.945 INFO:teuthology.orchestra.run.target167114239053.stdout:*********************************************
2018-09-08T09:28:48.945 INFO:teuthology.orchestra.run.target167114239053.stderr:+ _run_stage 3 ''
2018-09-08T09:28:48.945 INFO:teuthology.orchestra.run.target167114239053.stderr:+ local stage_num=3
2018-09-08T09:28:48.945 INFO:teuthology.orchestra.run.target167114239053.stderr:+ set +x
2018-09-08T09:28:48.946 INFO:teuthology.orchestra.run.target167114239053.stderr:+ timeout 60m salt-run --no-color state.orch ceph.stage.3
2018-09-08T09:28:48.946 INFO:teuthology.orchestra.run.target167114239053.stderr:+ tee /tmp/stage.3.log
2018-09-08T10:28:48.957 INFO:teuthology.orchestra.run.target167114239053.stderr:+ local exit_status=124
2018-09-08T10:28:48.958 INFO:teuthology.orchestra.run.target167114239053.stderr:+ set +x
2018-09-08T10:28:48.958 INFO:teuthology.orchestra.run.target167114239053.stdout:WWWW
2018-09-08T10:28:48.958 INFO:teuthology.orchestra.run.target167114239053.stdout:Stage 3 timed out after 60m

(There might be a bug in the tuned default-off.sls files, or I might not be activating them correctly, though.)

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

I think I've seen it happen in both SES5 and master. I'll start posting links to logs here:

  • http://ceph-ci.suse.de:8080/job/deepsea-pr/148/testReport/junit/teuthology/suse_tier1/functional__clusters_1node4disks_yaml_distros_sle_15_yaml_exec_functional_yaml_tasks_yaml_/attachments/teuthology-3386.log
  • http://ceph-ci.suse.de:8080/job/deepsea-pr/151/testReport/junit/teuthology/suse_tier1/functional__clusters_1node4disks_yaml_distros_sle_15_yaml_exec_functional_yaml_tasks_yaml_/attachments/teuthology-3406.log (re-run of the same PR)

smithfarm avatar Sep 08 '18 17:09 smithfarm

I've been running a lot of tests today in my private teuthology cluster in OVH region GRA1, and I'm not seeing this failure there at all.

The only Stage 3 timeout I saw was in a --cli run. SSH connection to Salt Master was lost, and pulpito shows the job as "dead".

http://149.202.174.223/ubuntu-2018-09-09_09:00:31-suse:tier2:openattic-ses5---basic-openstack/170/teuthology.log

But that only happened once, so I'm writing it up to "anomaly".

smithfarm avatar Sep 09 '18 09:09 smithfarm

It happens in my private GRA1 teuthology instance, too:

  • http://149.202.174.223:8081/ubuntu-2018-09-11_10:26:43-suse:tier1-ses5---basic-openstack/287/

smithfarm avatar Sep 11 '18 11:09 smithfarm

Happening in the SES6 Staging CI: http://10.86.0.53:8081/ubuntu-2018-09-14_22:18:46-suse-ses6---basic-openstack/

smithfarm avatar Sep 14 '18 23:09 smithfarm

Dropping priority because it doesn't happen on every CI run.

smithfarm avatar Sep 15 '18 07:09 smithfarm

Happening in DeepSea master PR CI: http://ceph-ci.suse.de:8080/job/deepsea-pr/170/testReport/teuthology/deepsea_tier2_health-ok/ceph_cm_salt_yaml_cluster__1node_yaml_4disks_yaml__deepsea_yaml_distros_sle_15_0_yaml_tasks_custom_profile_yaml/

smithfarm avatar Sep 25 '18 08:09 smithfarm

AFAICT this is not currently happening in SES5. Only in master/SES6.

smithfarm avatar Oct 01 '18 09:10 smithfarm

After Stage 0 completes, zypper ps -s shows that systemd-udevd has been updated. It's not inconceivable that an updated, but not restarted, systemd-udevd might cause issues with ceph-disk, and that might cause Stage 3 to hang.

It might or might not fix this, but either way it's worth updating the underlying image to eliminate this.

smithfarm avatar Oct 02 '18 10:10 smithfarm

The journalctl -r | head 2000 that runs after the failure shows that, every 5 minutes, systemd is restarting (or something similar) and, every 1 minute, node_exporter is logging two error messages:

2018-10-04T11:55:27.085 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:52:58 target192168000053.teuthology node_exporter[16187]: time="2018-10-04T11:52:58Z" level=error msg="ERROR: ntp collector failed after 0.001409s: couldn't get SNTP reply: read udp 127.0.0.1:42810->127.0.0.1:123: read: connection refused" source="collector.go:123"
2018-10-04T11:55:27.085 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:52:58 target192168000053.teuthology node_exporter[16187]: time="2018-10-04T11:52:58Z" level=error msg="Error on statfs() system call for \"/var/lib/ceph/osd/ceph-0\": permission denied" source="filesystem_linux.go:57"

The systemd restarts (every 5 minutes) look like this:

2018-10-04T11:55:27.174 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[1]: Removed slice User Slice of root.
2018-10-04T11:55:27.174 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[1]: Stopped User Manager for UID 0.
2018-10-04T11:55:27.174 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33823]: pam_unix(systemd-user:session): session closed for user root
2018-10-04T11:55:27.174 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Received SIGRTMIN+24 from PID 33862 (kill).
2018-10-04T11:55:27.175 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Starting Exit the Session...
2018-10-04T11:55:27.175 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Reached target Shutdown.
2018-10-04T11:55:27.175 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Closed D-Bus User Message Bus Socket.
2018-10-04T11:55:27.175 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Stopped target Timers.
2018-10-04T11:55:27.175 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Stopped target Sockets.
2018-10-04T11:55:27.175 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Stopped target Paths.
2018-10-04T11:55:27.176 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Stopped target Basic System.
2018-10-04T11:55:27.176 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[33822]: Stopped target Default.
2018-10-04T11:55:27.176 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology systemd[1]: Stopping User Manager for UID 0...
2018-10-04T11:55:27.176 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:02 target192168000053.teuthology CRON[33821]: pam_unix(crond:session): session closed for user root
2018-10-04T11:55:27.176 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology CRON[33828]: (root) CMD (/var/lib/prometheus/node-exporter/rbd.sh > /var/lib/prometheus/node-exporter/rbd.prom 2> /dev/null)
2018-10-04T11:55:27.176 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology cron[33821]: pam_unix(crond:session): session opened for user root by (uid=0)
2018-10-04T11:55:27.177 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[1]: Started User Manager for UID 0.
2018-10-04T11:55:27.177 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Startup finished in 29ms.
2018-10-04T11:55:27.177 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Reached target Default.
2018-10-04T11:55:27.177 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Reached target Basic System.
2018-10-04T11:55:27.177 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Reached target Sockets.
2018-10-04T11:55:27.177 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Listening on D-Bus User Message Bus Socket.
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Reached target Timers.
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Reached target Paths.
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: Starting D-Bus User Message Bus Socket.
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[33822]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[1]: Started Session 17 of user root.
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[1]: Starting User Manager for UID 0...
2018-10-04T11:55:27.178 INFO:tasks.workunit.client.salt_master.target192168000053.stdout:Oct 04 11:50:01 target192168000053.teuthology systemd[1]: Created slice User Slice of root.

Please note, these messages are in reverse order because journalctl -r.

smithfarm avatar Oct 04 '18 16:10 smithfarm

@jan--f This failure is happening pretty often in SES6 - see, e.g., http://10.86.0.135/ubuntu-2018-10-04_10:40:51-suse:tier0-ses6---basic-openstack/10/teuthology.log - do you think you could take a look at the journalctl -r output (just grep for that) and see what you make of it? Looks like systemd is somehow stuck repeating the same operations over and over again.

smithfarm avatar Oct 04 '18 16:10 smithfarm

This seems to happen quite often when using salt-run (i.e. in non-CLI tests).

I am in the process of migrating the test cases to a new YAML structure, and while doing that I will add more CLI test cases in addition to the non-CLI ones.

Also, merging #1435 should enable us to pinpoint where in Stage 3 the hang occurs.

smithfarm avatar Oct 23 '18 20:10 smithfarm

I finally reproduced the issue with the CLI (and without the buffering issue preventing proper logging of the Stage 3 output).

DeepSea is installed from source:

2018-10-23T21:23:14.054 DEBUG:tasks.deepsea:beginning of deepsea task begin method
2018-10-23T21:23:14.054 INFO:teuthology.orchestra.run.target192168000019:Running: 'rpm -q deepsea'
2018-10-23T21:23:14.094 INFO:teuthology.orchestra.run.target192168000019.stdout:package deepsea is not installed
2018-10-23T21:23:14.095 INFO:teuthology.task:Installing DeepSea from source - repo: https://github.com/smithfarm/DeepSea.git, branch: wip-stdout-flush
2018-10-23T21:23:14.096 INFO:teuthology.orchestra.run.target192168000019:Running: 'git --version ; git clone --branch wip-stdout-flush https://github.com/smithfarm/DeepSea.git ; cd DeepSea ; git rev-parse --abbrev-ref HEAD ; git rev-parse HEAD ; git describe || true'
2018-10-23T21:23:14.258 INFO:teuthology.orchestra.run.target192168000019.stdout:git version 2.16.4
2018-10-23T21:23:14.260 INFO:teuthology.orchestra.run.target192168000019.stderr:Cloning into 'DeepSea'...
2018-10-23T21:23:16.195 INFO:teuthology.orchestra.run.target192168000019.stdout:wip-stdout-flush
2018-10-23T21:23:16.197 INFO:teuthology.orchestra.run.target192168000019.stdout:e9dafcfda656348ad7080d127e3f6a5990a2bf9b
2018-10-23T21:23:16.206 INFO:teuthology.orchestra.run.target192168000019.stdout:v0.9.6-229-ge9dafcfd
2018-10-23T21:23:16.208 INFO:teuthology.task:Running "make install" in DeepSea clone...

The SHA1 is e9dafcfda656348ad7080d127e3f6a5990a2bf9b which is today's master branch plus the commit from #1435. (Don't get confused by the git describe output saying 0.9.6 - I just forgot to push the v0.9.7 tag to smithfarm/DeepSea.git.)

The cluster is single-node, with roles:

018-10-23T21:29:07.641 INFO:teuthology.orchestra.run.target192168000019.stdout:    roles:
2018-10-23T21:29:07.641 INFO:teuthology.orchestra.run.target192168000019.stdout:        - master
2018-10-23T21:29:07.641 INFO:teuthology.orchestra.run.target192168000019.stdout:        - admin
2018-10-23T21:29:07.641 INFO:teuthology.orchestra.run.target192168000019.stdout:        - mon
2018-10-23T21:29:07.642 INFO:teuthology.orchestra.run.target192168000019.stdout:        - mgr
2018-10-23T21:29:07.642 INFO:teuthology.orchestra.run.target192168000019.stdout:        - mds
2018-10-23T21:29:07.642 INFO:teuthology.orchestra.run.target192168000019.stdout:        - rgw
2018-10-23T21:29:07.642 INFO:teuthology.orchestra.run.target192168000019.stdout:        - storage

Stage 3 is run with timeout 60m, and fails like this:

2018-10-23T21:29:08.180 INFO:teuthology.task:deepsea_deploy: WWWW: running Stage 3
2018-10-23T21:29:08.181 INFO:teuthology.orchestra.run.target192168000019:Running: "sudo bash -c 'DEV_ENV=true timeout 60m deepsea --log-file=/var/log/salt/deepsea.log --log-level=debug stage run ceph.stage.3 --simple-output'"
2018-10-23T21:29:08.997 INFO:teuthology.orchestra.run.target192168000019.stdout:Starting orchestration: ceph.stage.3
2018-10-23T21:29:43.327 INFO:teuthology.orchestra.run.target192168000019.stdout:Parsing orchestration ceph.stage.3 steps... done
2018-10-23T21:29:43.328 INFO:teuthology.orchestra.run.target192168000019.stdout:
2018-10-23T21:29:43.329 INFO:teuthology.orchestra.run.target192168000019.stdout:Stage initialization output:
2018-10-23T21:29:43.329 INFO:teuthology.orchestra.run.target192168000019.stdout:firewall                 : [1m[92mdisabled[0m
2018-10-23T21:29:43.329 INFO:teuthology.orchestra.run.target192168000019.stdout:apparmor                 : [1m[92mdisabled[0m
2018-10-23T21:29:43.329 INFO:teuthology.orchestra.run.target192168000019.stdout:DEV_ENV                  : [1m[92mTrue[0m
2018-10-23T21:29:43.330 INFO:teuthology.orchestra.run.target192168000019.stdout:fsid                     : [1m[92mvalid[0m
2018-10-23T21:29:43.330 INFO:teuthology.orchestra.run.target192168000019.stdout:public_network           : [1m[92mvalid[0m
2018-10-23T21:29:43.330 INFO:teuthology.orchestra.run.target192168000019.stdout:cluster_network          : [1m[92mvalid[0m
2018-10-23T21:29:43.330 INFO:teuthology.orchestra.run.target192168000019.stdout:cluster_interface        : [1m[92mvalid[0m
2018-10-23T21:29:43.330 INFO:teuthology.orchestra.run.target192168000019.stdout:monitors                 : [1m[92mvalid[0m
2018-10-23T21:29:43.331 INFO:teuthology.orchestra.run.target192168000019.stdout:mgrs                     : [1m[92mvalid[0m
2018-10-23T21:29:43.331 INFO:teuthology.orchestra.run.target192168000019.stdout:storage                  : [1m[92mvalid[0m
2018-10-23T21:29:43.331 INFO:teuthology.orchestra.run.target192168000019.stdout:ganesha                  : [1m[92mvalid[0m
2018-10-23T21:29:43.331 INFO:teuthology.orchestra.run.target192168000019.stdout:master_role              : [1m[92mvalid[0m
2018-10-23T21:29:43.331 INFO:teuthology.orchestra.run.target192168000019.stdout:time_server              : [1m[92mvalid[0m
2018-10-23T21:29:43.331 INFO:teuthology.orchestra.run.target192168000019.stdout:fqdn                     : [1m[92mvalid[0m
2018-10-23T21:29:43.332 INFO:teuthology.orchestra.run.target192168000019.stdout:
2018-10-23T21:29:44.260 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner ready.check...
2018-10-23T21:29:44.261 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  cmd.shell(/usr/sbin/iptables -S)...
2018-10-23T21:29:44.324 INFO:teuthology.orchestra.run.target192168000019.stdout:               in target192168000019.teuthology... ok
2018-10-23T21:29:44.433 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  cmd.shell(/usr/sbin/aa-status --enabled 2>/dev/null; echo $?)...
2018-10-23T21:29:44.492 INFO:teuthology.orchestra.run.target192168000019.stdout:               in target192168000019.teuthology... ok
2018-10-23T21:29:44.601 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  state.apply(ceph.apparmor.profiles, __kwarg__=True, test=true)...
2018-10-23T21:29:45.034 INFO:teuthology.orchestra.run.target192168000019.stdout:               in target192168000019.teuthology... ok
2018-10-23T21:29:45.145 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner ready.check... ok
2018-10-23T21:29:47.896 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.mon... ok
2018-10-23T21:29:48.520 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.osd... ok
2018-10-23T21:29:49.109 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.mgr... ok
2018-10-23T21:29:49.951 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.config... ok
2018-10-23T21:29:50.532 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.client... ok
2018-10-23T21:29:51.114 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.global... ok
2018-10-23T21:29:51.698 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.mds... ok
2018-10-23T21:29:52.274 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner changed.igw... ok
2018-10-23T21:29:53.018 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner select.minions... ok
2018-10-23T21:29:53.622 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner select.minions... ok
2018-10-23T21:29:54.220 INFO:teuthology.orchestra.run.target192168000019.stdout:[init] Executing runner select.minions... ok
2018-10-23T21:29:54.287 INFO:teuthology.orchestra.run.target192168000019.stdout:
2018-10-23T21:29:54.288 INFO:teuthology.orchestra.run.target192168000019.stdout:[1/29] Executing state ceph.time...
2018-10-23T21:29:54.698 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:29:54.856 INFO:teuthology.orchestra.run.target192168000019.stdout:[2/29] Executing state ceph.packages...
2018-10-23T21:30:41.578 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: ceph ok
2018-10-23T21:30:41.591 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:41.782 INFO:teuthology.orchestra.run.target192168000019.stdout:[3/29] Executing state ceph.configuration.check...
2018-10-23T21:30:42.565 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:42.728 INFO:teuthology.orchestra.run.target192168000019.stdout:[4/29] Executing state ceph.configuration.create...
2018-10-23T21:30:44.114 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  select.minions(cluster=ceph, roles=mon, host=True)... ok
2018-10-23T21:30:44.481 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  select.public_addresses(cluster=ceph, roles=mon)...
2018-10-23T21:30:44.481 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  public.address()...
2018-10-23T21:30:44.551 INFO:teuthology.orchestra.run.target192168000019.stdout:               in target192168000019.teuthology... ok
2018-10-23T21:30:44.658 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  select.public_addresses(cluster=ceph, roles=mon)... ok
2018-10-23T21:30:45.798 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  select.from(pillar=rgw_configurations, role=rgw, attr=host, fqdn)... ok
2018-10-23T21:30:45.881 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: /srv/salt/ceph/configuration/cache/ceph.conf ok
2018-10-23T21:30:45.953 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:46.116 INFO:teuthology.orchestra.run.target192168000019.stdout:[5/29] Executing state ceph.configuration...
2018-10-23T21:30:46.596 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: /etc/ceph/ceph.conf ok
2018-10-23T21:30:46.608 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:46.769 INFO:teuthology.orchestra.run.target192168000019.stdout:[6/29] Executing state ceph.admin...
2018-10-23T21:30:47.208 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:47.366 INFO:teuthology.orchestra.run.target192168000019.stdout:[7/29] Executing state ceph.mgr.keyring...
2018-10-23T21:30:47.850 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: /var/lib/ceph/mgr/ceph-target192168000019/keyring ok
2018-10-23T21:30:47.862 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:48.022 INFO:teuthology.orchestra.run.target192168000019.stdout:[8/29] Executing state ceph.mon...
2018-10-23T21:30:48.464 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: /var/lib/ceph/tmp/keyring.mon ok
2018-10-23T21:30:56.314 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: wait for mon(cephprocesses.wait) ok
2018-10-23T21:30:56.328 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:56.514 INFO:teuthology.orchestra.run.target192168000019.stdout:[9/29] Executing state ceph.mgr.auth...
2018-10-23T21:30:57.717 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  select.minions(cluster=ceph, roles=mgr, host=True)... ok
2018-10-23T21:30:58.142 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:30:58.307 INFO:teuthology.orchestra.run.target192168000019.stdout:[10/29] Executing state ceph.mgr...
2018-10-23T21:30:58.833 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: /var/lib/ceph/mgr/ceph-target192168000019/keyring ok
2018-10-23T21:31:06.405 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: wait for mgr(cephprocesses.wait) ok
2018-10-23T21:31:06.422 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:31:06.608 INFO:teuthology.orchestra.run.target192168000019.stdout:[11/29] Executing state ceph.monitoring.prometheus.exporters.ceph_exporter...
2018-10-23T21:31:22.969 INFO:teuthology.orchestra.run.target192168000019.stdout:         |_  target192168000019.teuthology: ceph exporter package(golang-github-digitalocean-ceph_exporter) ok
2018-10-23T21:31:30.801 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:31:30.977 INFO:teuthology.orchestra.run.target192168000019.stdout:[12/29] Executing state ceph.monitoring.prometheus.exporters.rbd_exporter...
2018-10-23T21:31:43.493 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:31:43.656 INFO:teuthology.orchestra.run.target192168000019.stdout:[13/29] Executing state ceph.osd.auth...
2018-10-23T21:31:44.760 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:31:44.920 INFO:teuthology.orchestra.run.target192168000019.stdout:[14/29] Executing state ceph.sysctl...
2018-10-23T21:31:45.359 INFO:teuthology.orchestra.run.target192168000019.stdout:         target192168000019.teuthology... ok
2018-10-23T21:31:45.519 INFO:teuthology.orchestra.run.target192168000019.stdout:[15/29] Executing state ceph.osd...
2018-10-23T22:29:08.325 ERROR:teuthology.task:deepsea_deploy: WWWW: Stage 3 failed. Here comes journalctl!
2018-10-23T22:29:08.326 INFO:teuthology.orchestra.run.target192168000019:Running: 'sudo journalctl --all'
2018-10-23T22:29:08.410 INFO:teuthology.orchestra.run.target192168000019.stdout:-- Logs begin at Tue 2018-10-23 21:19:06 UTC, end at Tue 2018-10-23 22:29:08 UTC. --

The full teuthology.log, including full post-Stage 3 journalctl, is here: https://gist.github.com/smithfarm/2005f109fa481ba988c842a1a5aee728

The deepsea CLI and salt logs are here: http://10.86.0.135/ubuntu-2018-10-23_21:18:30-deepsea:tier1-wip-qa-port-tier1---basic-openstack/157/

Should that link disappear, I have them saved on my laptop in the directory ~/scratch/issue1382.

smithfarm avatar Oct 24 '18 04:10 smithfarm

It's just a hunch, but the issue might be related to the presence of mds and rgw roles on the node. I have run Stages 0-3 quite a few times on single-node clusters with just a storage role (no mds, no rgw) and so far Stage 3 always completes in that configuration. I will add that test to deepsea/tier1 so we can have the CI running it together with the one that exhibits the bug.

smithfarm avatar Oct 24 '18 04:10 smithfarm

The issue started happening in SES5 today - at least, this is the first time I've seen it there. The job was testing the tip of the SES5 branch, which would indicate https://github.com/SUSE/DeepSea/pull/1432 might be the culprit. (The hang happens when executing state ceph.osd, and that pull request is touching the osd.py module...)

log: https://ci.suse.de/job/storage-run-teuthology-5.0/lastCompletedBuild/testReport/teuthology/Devel/suse_tier2_functional__clusters_1node4disks_yaml_distros_sle_12_3_yaml_exec_functional_cli_yaml_tasks_yaml_/attachments/teuthology-suse-93.log

2018-10-24T08:55:49.333 INFO:teuthology.orchestra.run.target158069067116.stderr:+ deepsea --version
2018-10-24T08:55:49.609 INFO:teuthology.orchestra.run.target158069067116.stdout:deepsea 0.8.7+4.gb456a7da

smithfarm avatar Oct 24 '18 20:10 smithfarm

log: https://ci.suse.de/job/storage-run-teuthology-5.0/lastCompletedBuild/testReport/teuthology/Devel/suse_tier2_functional__clusters_1node4disks_yaml_distros_sle_12_3_yaml_exec_functional_cli_yaml_tasks_yaml_/attachments/teuthology-suse-93.log

I get a 404 here

jschmid1 avatar Oct 25 '18 08:10 jschmid1

@jschmid1 Sorry about that. Please try this one: https://ci.suse.de/user/storage/my-views/view/Monitor/job/storage-deepsea-5.0/66/testReport/teuthology/Devel/suse_tier2_functional__clusters_1node4disks_yaml_distros_sle_12_3_yaml_exec_functional_cli_yaml_tasks_yaml_/attachments/teuthology-suse-93.log

smithfarm avatar Oct 25 '18 08:10 smithfarm

I just learned that the teuthology environment used by Jenkins has a custom nameserver. That probably explains why I have not been able to reproduce this in my ad hoc teuthology runs.

(I did get Stage 3 to time out once - see https://github.com/SUSE/DeepSea/issues/1382#issuecomment-432507818 -, but that's once in over a hundred runs, while in the Jenkins CI it happens much more often. Given that the Jenkins CI has a custom nameserver, I tend to think that is a factor, and that the timeout I saw in my ad hoc teuthology environment was due to some other anomaly.)

smithfarm avatar Oct 29 '18 22:10 smithfarm

@smithfarm Hanging stage 3 issue is not related to custom nameserver since it is running on ECP which is setup via standard method.

kshtsk avatar Oct 29 '18 22:10 kshtsk

We are no longer seeing the OSD-related Stage 3 timeout, and the timeouts appear to be less frequent. Today we started to see this:

2018-11-09T09:08:25.472 INFO:tasks.deepsea.orch:WWWW: Running DeepSea Stage 3 (deploy)
2018-11-09T09:08:25.472 INFO:teuthology.orchestra.run.target192168000063:Running: "sudo bash -c 'DEV_ENV=true timeout 60m deepsea --log-file=/var/log/salt/deepsea.log --log-level=debug stage run ceph.stage.3 --simple-output'"
2018-11-09T09:08:26.612 INFO:teuthology.orchestra.run.target192168000063.stdout:Starting orchestration: ceph.stage.3
2018-11-09T09:09:20.038 INFO:teuthology.orchestra.run.target192168000063.stdout:Parsing orchestration ceph.stage.3 steps... done
2018-11-09T09:09:20.039 INFO:teuthology.orchestra.run.target192168000063.stdout:
2018-11-09T09:09:20.040 INFO:teuthology.orchestra.run.target192168000063.stdout:Stage initialization output:
2018-11-09T09:09:20.040 INFO:teuthology.orchestra.run.target192168000063.stdout:firewall                 : [1m[92mdisabled[0m
2018-11-09T09:09:20.040 INFO:teuthology.orchestra.run.target192168000063.stdout:apparmor                 : [1m[92mdisabled[0m
2018-11-09T09:09:20.040 INFO:teuthology.orchestra.run.target192168000063.stdout:DEV_ENV                  : [1m[92mTrue[0m
2018-11-09T09:09:20.041 INFO:teuthology.orchestra.run.target192168000063.stdout:fsid                     : [1m[92mvalid[0m
2018-11-09T09:09:20.041 INFO:teuthology.orchestra.run.target192168000063.stdout:public_network           : [1m[92mvalid[0m
2018-11-09T09:09:20.041 INFO:teuthology.orchestra.run.target192168000063.stdout:cluster_network          : [1m[92mvalid[0m
2018-11-09T09:09:20.041 INFO:teuthology.orchestra.run.target192168000063.stdout:cluster_interface        : [1m[92mvalid[0m
2018-11-09T09:09:20.041 INFO:teuthology.orchestra.run.target192168000063.stdout:monitors                 : [1m[92mvalid[0m
2018-11-09T09:09:20.042 INFO:teuthology.orchestra.run.target192168000063.stdout:mgrs                     : [1m[92mvalid[0m
2018-11-09T09:09:20.042 INFO:teuthology.orchestra.run.target192168000063.stdout:storage                  : [1m[92mvalid[0m
2018-11-09T09:09:20.042 INFO:teuthology.orchestra.run.target192168000063.stdout:ganesha                  : [1m[92mvalid[0m
2018-11-09T09:09:20.042 INFO:teuthology.orchestra.run.target192168000063.stdout:master_role              : [1m[92mvalid[0m
2018-11-09T09:09:20.043 INFO:teuthology.orchestra.run.target192168000063.stdout:time_server              : [1m[92mvalid[0m
2018-11-09T09:09:20.043 INFO:teuthology.orchestra.run.target192168000063.stdout:fqdn                     : [1m[92mvalid[0m
2018-11-09T09:09:20.043 INFO:teuthology.orchestra.run.target192168000063.stdout:No minions matched the target. No command was sent, no jid was assigned.
2018-11-09T09:09:20.043 INFO:teuthology.orchestra.run.target192168000063.stdout:
2018-11-09T09:09:24.752 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner ready.check...
2018-11-09T09:09:24.752 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  cmd.shell(/usr/sbin/iptables -S)...
2018-11-09T09:09:24.857 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000030.teuthology... ok
2018-11-09T09:09:24.881 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000063.teuthology... ok
2018-11-09T09:09:25.000 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  cmd.shell(/usr/sbin/aa-status --enabled 2>/dev/null; echo $?)...
2018-11-09T09:09:25.089 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000030.teuthology... ok
2018-11-09T09:09:25.110 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000063.teuthology... ok
2018-11-09T09:09:25.224 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  state.apply(ceph.apparmor.profiles, __kwarg__=True, test=true)...
2018-11-09T09:09:25.808 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000030.teuthology... ok
2018-11-09T09:09:25.878 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000063.teuthology... ok
2018-11-09T09:09:25.992 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner ready.check... ok
2018-11-09T09:09:30.436 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.mon... ok
2018-11-09T09:09:31.318 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.osd... ok
2018-11-09T09:09:32.214 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.mgr... ok
2018-11-09T09:09:33.542 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.config... ok
2018-11-09T09:09:34.651 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.client... ok
2018-11-09T09:09:35.562 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.global... ok
2018-11-09T09:09:36.369 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.mds... ok
2018-11-09T09:09:37.280 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner changed.igw... ok
2018-11-09T09:09:38.258 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner select.minions... ok
2018-11-09T09:09:39.230 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner select.minions... ok
2018-11-09T09:09:40.221 INFO:teuthology.orchestra.run.target192168000063.stdout:[init] Executing runner select.minions... ok
2018-11-09T09:09:40.311 INFO:teuthology.orchestra.run.target192168000063.stdout:
2018-11-09T09:09:40.311 INFO:teuthology.orchestra.run.target192168000063.stdout:[1/29] Executing state ceph.time...
2018-11-09T09:09:41.022 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:09:50.128 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000030.teuthology: stop ntpd(ntpd) ok
2018-11-09T09:09:56.530 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000030.teuthology: install chrony ok
2018-11-09T09:09:57.408 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000030.teuthology: /etc/chrony.conf ok
2018-11-09T09:09:57.503 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000030.teuthology: start chronyd(chronyd) ok
2018-11-09T09:09:57.572 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000030.teuthology... ok
2018-11-09T09:09:57.792 INFO:teuthology.orchestra.run.target192168000063.stdout:[2/29] Executing state ceph.packages...
2018-11-09T09:10:57.412 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000030.teuthology: ceph ok
2018-11-09T09:10:57.629 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000030.teuthology... ok
2018-11-09T09:11:34.644 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: ceph ok
2018-11-09T09:11:34.683 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:34.832 INFO:teuthology.orchestra.run.target192168000063.stdout:[3/29] Executing state ceph.configuration.check...
2018-11-09T09:11:36.136 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:36.324 INFO:teuthology.orchestra.run.target192168000063.stdout:[4/29] Executing state ceph.configuration.create...
2018-11-09T09:11:38.318 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  select.minions(cluster=ceph, roles=mon, host=True)... ok
2018-11-09T09:11:38.838 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  select.public_addresses(cluster=ceph, roles=mon)...
2018-11-09T09:11:38.839 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  public.address()...
2018-11-09T09:11:38.964 INFO:teuthology.orchestra.run.target192168000063.stdout:               in target192168000063.teuthology... ok
2018-11-09T09:11:39.079 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  select.public_addresses(cluster=ceph, roles=mon)... ok
2018-11-09T09:11:40.072 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  select.from(pillar=rgw_configurations, role=rgw, attr=host, fqdn)... ok
2018-11-09T09:11:40.187 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: /srv/salt/ceph/configuration/cache/ceph.conf ok
2018-11-09T09:11:40.311 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:40.497 INFO:teuthology.orchestra.run.target192168000063.stdout:[5/29] Executing state ceph.configuration...
2018-11-09T09:11:41.133 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000030.teuthology: /etc/ceph/ceph.conf ok
2018-11-09T09:11:41.159 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000030.teuthology... ok
2018-11-09T09:11:41.170 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: /etc/ceph/ceph.conf ok
2018-11-09T09:11:41.192 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:41.380 INFO:teuthology.orchestra.run.target192168000063.stdout:[6/29] Executing state ceph.admin...
2018-11-09T09:11:42.141 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:42.160 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000030.teuthology... ok
2018-11-09T09:11:42.360 INFO:teuthology.orchestra.run.target192168000063.stdout:[7/29] Executing state ceph.mgr.keyring...
2018-11-09T09:11:43.075 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: /var/lib/ceph/mgr/ceph-target192168000063/keyring ok
2018-11-09T09:11:43.098 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:43.277 INFO:teuthology.orchestra.run.target192168000063.stdout:[8/29] Executing state ceph.mon...
2018-11-09T09:11:43.987 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: /var/lib/ceph/tmp/keyring.mon ok
2018-11-09T09:11:57.246 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: wait for mon(cephprocesses.wait) ok
2018-11-09T09:11:57.531 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:11:57.742 INFO:teuthology.orchestra.run.target192168000063.stdout:[9/29] Executing state ceph.mgr.auth...
2018-11-09T09:12:00.031 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  select.minions(cluster=ceph, roles=mgr, host=True)... ok
2018-11-09T09:12:00.817 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:12:01.014 INFO:teuthology.orchestra.run.target192168000063.stdout:[10/29] Executing state ceph.mgr...
2018-11-09T09:12:01.900 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: /var/lib/ceph/mgr/ceph-target192168000063/keyring ok
2018-11-09T09:12:14.199 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: wait for mgr(cephprocesses.wait) ok
2018-11-09T09:12:14.232 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:12:14.439 INFO:teuthology.orchestra.run.target192168000063.stdout:[11/29] Executing state ceph.monitoring.prometheus.exporters.ceph_exporter...
2018-11-09T09:12:39.121 INFO:teuthology.orchestra.run.target192168000063.stdout:         |_  target192168000063.teuthology: ceph exporter package(golang-github-digitalocean-ceph_exporter) ok
2018-11-09T09:12:51.488 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:12:51.696 INFO:teuthology.orchestra.run.target192168000063.stdout:[12/29] Executing state ceph.monitoring.prometheus.exporters.rbd_exporter...
2018-11-09T09:13:11.362 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:13:11.564 INFO:teuthology.orchestra.run.target192168000063.stdout:[13/29] Executing state ceph.osd.auth...
2018-11-09T09:13:13.338 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:13:13.542 INFO:teuthology.orchestra.run.target192168000063.stdout:[14/29] Executing state ceph.sysctl...
2018-11-09T09:13:14.889 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:13:15.094 INFO:teuthology.orchestra.run.target192168000063.stdout:[15/29] Executing state ceph.osd...
2018-11-09T09:14:10.816 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:14:11.021 INFO:teuthology.orchestra.run.target192168000063.stdout:[16/29] Executing state ceph.osd.grains...
2018-11-09T09:14:13.069 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000030.teuthology... ok
2018-11-09T09:14:13.085 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... ok
2018-11-09T09:14:13.275 INFO:teuthology.orchestra.run.target192168000063.stdout:[17/29] Executing state ceph.tuned.mgr...
2018-11-09T09:24:27.465 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... fail
2018-11-09T09:24:27.563 INFO:teuthology.orchestra.run.target192168000063.stdout:Finished stage ceph.stage.3: succeeded=16/29 failed=1/29
2018-11-09T09:24:27.564 INFO:teuthology.orchestra.run.target192168000063.stdout:
2018-11-09T09:24:27.564 INFO:teuthology.orchestra.run.target192168000063.stdout:Failures summary:
2018-11-09T09:24:27.564 INFO:teuthology.orchestra.run.target192168000063.stdout:ceph.tuned.mgr (/srv/salt/ceph/tuned/mgr):
2018-11-09T09:24:27.565 INFO:teuthology.orchestra.run.target192168000063.stdout:  target192168000063.teuthology:
2018-11-09T09:24:27.565 INFO:teuthology.orchestra.run.target192168000063.stdout:    apply tuned ceph mgr: Command "tuned-adm profile ceph-mgr" run
2018-11-09T09:24:27.565 INFO:teuthology.orchestra.run.target192168000063.stdout:        stdout: Operation timed out after waiting 600 seconds(s), you may try to increase timeout by using --timeout command line option or using --async.
2018-11-09T09:24:27.565 INFO:teuthology.orchestra.run.target192168000063.stdout:        stderr:
2018-11-09T09:24:29.194 ERROR:tasks.deepsea.orch:WWWW: orchestration ceph.stage.3 failed. Here comes journalctl!

@jschmid1 @mogeb Any idea why that particular step would time out?

smithfarm avatar Nov 09 '18 09:11 smithfarm

@smithfarm hmm, not really, but the tuned salt module has a couple of bugs so it could be the cause of these timeouts.

Can we examine the system at that point? For instance can we make sure that tuned is up and running on the target node? How reproducible is this issue? Does the timeout seem to happen whether tuned is deployed or not (I see it wasn't deployed in the very first message of this issue)?

mogeb avatar Nov 09 '18 13:11 mogeb

Can we examine the system at that point?

Not now, because the issue is not reliably reproducible. This is a source of transient failures in the test, so as far as the CI is concerned the "tuned: on" test case is unreliable and has to be eliminated from the CI.

I assume the issue is reproducible - it just needs multiple tries.

For instance can we make sure that tuned is up and running on the target node?

How could we verify that? It's easy to instrument the test to add verifications.

How reproducible is this issue?

Unknown. Once I encounter a transient issue in the CI, I have to eliminate the test from the CI (or in this case eliminate tuned from the CI).

Does the timeout seem to happen whether tuned is deployed or not (I see it wasn't deployed in the very first message of this issue)?

I sincerely hope not (it was deployed this time).

smithfarm avatar Nov 09 '18 13:11 smithfarm

I had tried locally in my vagrant setup but was never able to reproduce it. It would be great if I can take a look at the system when that happens.

For instance can we make sure that tuned is up and running on the target node?

How could we verify that? It's easy to instrument the test to add verifications.

systemctl status tuned

mogeb avatar Nov 09 '18 13:11 mogeb

@mogeb Is your vagrant setup running SLE-15 and deepsea master?

smithfarm avatar Nov 09 '18 13:11 smithfarm

To recapitulate:

  1. it appears (from the test output) that Stage 3 step [17/29] Executing state ceph.tuned.mgr... runs the command tuned-adm profile ceph-mgr and this command occasionally times out.
  2. the CI will run with tuned disabled for now, to test my assumption that doing so will avoid hitting this particular failure
  3. in Tier 3, I'll add a test case with tuned enabled. This could be used in an attempt to reproduce the issue (I'll put it on my list of stuff to do in the future).

Finally:

  1. I have not seen the OSD-related timeouts since the CI was upgraded to nautilus

smithfarm avatar Nov 09 '18 13:11 smithfarm

@mogeb This might be interesting. Look at the timestamps:

2018-11-09T09:14:13.275 INFO:teuthology.orchestra.run.target192168000063.stdout:[17/29] Executing state ceph.tuned.mgr...
2018-11-09T09:24:27.465 INFO:teuthology.orchestra.run.target192168000063.stdout:         target192168000063.teuthology... fail

This state ceph.tuned.mgr is taking slightly over 600 seconds (10 minutes) to run. Examining the journalctl output, I see:

2018-11-09T09:24:37.467 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:24 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing command ['systemd-run', '--scope', 'systemctl', 'start', 'tuned.service'] in directory '/root'
2018-11-09T09:24:37.467 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:24 target192168000063.teuthology systemd[1]: Started /usr/bin/systemctl start tuned.service.
2018-11-09T09:24:37.467 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:24 target192168000063.teuthology systemd[1]: Starting Dynamic System Tuning Daemon...
2018-11-09T09:24:37.467 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:24 target192168000063.teuthology systemd[1]: Starting Authorization Manager...
2018-11-09T09:24:37.467 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:24 target192168000063.teuthology polkitd[27048]: Started polkitd version 0.114
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology polkitd[27048]: Loading rules from directory /etc/polkit-1/rules.d
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology polkitd[27048]: Loading rules from directory /usr/share/polkit-1/rules.d
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology polkitd[27048]: Finished loading, compiling and executing 1 rules
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology systemd[1]: Started Authorization Manager.
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology polkitd[27048]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology systemd[1]: Started Dynamic System Tuning Daemon.
2018-11-09T09:24:37.468 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] stderr: Running scope as unit: run-raef4e37256e64fd88f161db207021403.scope
2018-11-09T09:24:37.469 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing command ['systemctl', 'is-active', 'tuned.service'] in directory '/root'
2018-11-09T09:24:37.469 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] output: active
2018-11-09T09:24:37.469 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing command ['systemctl', 'is-enabled', 'tuned.service'] in directory '/root'
2018-11-09T09:24:37.469 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] output: disabled
2018-11-09T09:24:37.469 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing command ['systemctl', 'is-enabled', 'tuned.service'] in directory '/root'
2018-11-09T09:24:37.470 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] output: disabled
2018-11-09T09:24:37.470 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing command ['systemd-run', '--scope', 'systemctl', 'enable', 'tuned.service'] in directory '/root'
2018-11-09T09:24:37.470 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology systemd[1]: Started /usr/bin/systemctl enable tuned.service.
2018-11-09T09:24:37.470 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology systemd[1]: Reloading.
2018-11-09T09:24:37.470 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] stderr: Running scope as unit: run-r9530a2f51fe64dd2b67e0afe1d1d90ee.scope
2018-11-09T09:24:37.470 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: Created symlink /etc/systemd/system/multi-user.target.wants/tuned.service -> /usr/lib/systemd/system/tuned.service.
2018-11-09T09:24:37.471 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing command ['systemctl', 'is-enabled', 'tuned.service'] in directory '/root'
2018-11-09T09:24:37.471 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology systemd[1]: is_symlink_with_known_name(tuned.service, tuned.service) → 1
2018-11-09T09:24:37.471 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] output: enabled
2018-11-09T09:24:37.471 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] {'tuned': True}
2018-11-09T09:24:37.471 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Completed state [tuned] at time 09:14:25.915070 (duration_in_ms=1140.212)
2018-11-09T09:24:37.471 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [DEBUG   ] LazyLoaded cmd.run
2018-11-09T09:24:37.472 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Running state [tuned-adm profile ceph-mgr] at time 09:14:25.919641
2018-11-09T09:24:37.472 INFO:teuthology.orchestra.run.target192168000063.stdout:Nov 09 09:14:25 target192168000063.teuthology salt-minion[3978]: [INFO    ] Executing state cmd.run for [tuned-adm profile ceph-mgr]

So, it's pretty obvious that an attempt is made to start the tuned.service before issuing tuned-adm command, but it's not clear whether the service was actually running when the command was issued.

We have had this issue already in the past. Salt issues a systemctl command to start a service, and that command returns a 0 status code, so Salt assumes all is well. But the service fails a short while later (Salt does not notice this).

I believe that's what's happening here. One solution would to be insert a busy-wait loop which (for example) checks once per second whether the service is really up and running, and fails after some seconds if it doesn't start properly. (And runs tuned-adm profile only if the service is really up and running.)

smithfarm avatar Nov 09 '18 14:11 smithfarm

@mogeb The first version of the previous comment contained errors, which I have now corrected. Please take another look?

smithfarm avatar Nov 09 '18 14:11 smithfarm

@smithfarm very interesting, and that would be a very logical explanation, especially if there's a precedent for a similar issue. Should I add a busy loop in the tuned-adm profile step that blocks until the service is up and running? Or do we have a better way to handle this case in deepsea?

mogeb avatar Nov 09 '18 14:11 mogeb

That's a question for @jschmid1

smithfarm avatar Nov 09 '18 14:11 smithfarm

But why would deepsea wait until the tuned-adm profile times out before starting the service? Maybe there's an issue with the way I orchestrated these steps in deepsea. As you said @jschmid1 opinion will be valuable here :)

mogeb avatar Nov 09 '18 14:11 mogeb

@mogeb It doesn't wait. See the journalctl excerpt in https://github.com/SUSE/DeepSea/issues/1382#issuecomment-437369166

It:

  1. executes systemctl start tuned.service, gets 0 status back
  2. runs tuned-adm profile

in that order. Which works some (most?) of the time. But occasionally the service does not start (for reasons unknown) and that causes the timeout.

The problem, as I see it, is that Salt incorrectly named the state service.running. It doesn't actually check that the service is running, and it doesn't wait. It gets 0 status back from "servicectl start" and is satisfied with that.

smithfarm avatar Nov 09 '18 14:11 smithfarm

tuned-adm profile starts at 09:14:13.275 and times out at 09:24:27.465 Deepsea then tries to start the service at 09:24:37.467... or is it at 09:14:24 (i just realized there are two timestamps on the same line)? If it's the latter, then yes you're right and it does indeed look like a race condition.

mogeb avatar Nov 09 '18 14:11 mogeb