DeepSea icon indicating copy to clipboard operation
DeepSea copied to clipboard

ceph.restart orchestration could check if all needed roles are deployed and show improved error message if any roles are missing

Open smithfarm opened this issue 5 years ago • 5 comments

DeepSea master branch, tip is 12965311c0c6f4ad69e5854f498907df7f9d1cea

On a single-node SLE15/SES6 cluster with 4 external drives, I run Stages 0-3, get HEALTH_OK.

Then I do salt-run state.orch ceph.smoketests and this fails with the following error blob:

target147135133120.teuthology_master:
    Data failed to compile:
----------
    Rendering SLS 'base:ceph.smoketests.restart.mds' failed: mapping values are not allowed here; line 9

---
[...]



reset systemctl initially for mds:
  salt.state:
    - tgt: Exception occurred in runner select.one_minion: Traceback (most recent call last):    <======================
  File "/usr/lib/python3.6/site-packages/salt/client/mixins.py", line 387, in _low
    data['return'] = self.functions[fun](*args, **kwargs)
  File "/srv/modules/runners/select.py", line 96, in one_minion
    return ret[0]
IndexError: list index out of range
[...]
---
----------
    Rendering SLS 'base:ceph.smoketests.restart.rgw' failed: mapping values are not allowed here; line 9

---
[...]


reset systemctl initially for rgw:
  salt.state:
    - tgt: Exception occurred in runner select.one_minion: Traceback (most recent call last):    <======================
  File "/usr/lib/python3.6/site-packages/salt/client/mixins.py", line 387, in _low
    data['return'] = self.functions[fun](*args, **kwargs)
  File "/srv/modules/runners/select.py", line 96, in one_minion
    return ret[0]
IndexError: list index out of range
[...]
---

smithfarm avatar Jul 24 '18 13:07 smithfarm

@jschmid1 tells me this is because the ceph.restart orchestration requires a cluster with mds and rgw roles deployed.

The above error occurs when these roles are absent.

smithfarm avatar Jul 24 '18 14:07 smithfarm

Right, because the smoketests do not implement an additional check if the roles are actually implemented.

It's worth a discussion if they should actually do

jschmid1 avatar Jul 25 '18 08:07 jschmid1

I think we solved that by re-implementing the way we run those restart tests.

jschmid1 avatar Jul 31 '18 11:07 jschmid1

Yes, the CI can now run these tests, but reopening the issue to track the problematic error handling.

smithfarm avatar Jul 31 '18 11:07 smithfarm

This could be resolved by implementing a validate runner for the functests and triggering it in init.sls (similar to how it is triggered by the stage orchestrations).

smithfarm avatar Nov 10 '18 18:11 smithfarm