DeepSea icon indicating copy to clipboard operation
DeepSea copied to clipboard

ceph.stage.discovery fails if disk doesn't have entry in /dev

Open alexdepalex opened this issue 5 years ago • 0 comments

Description of Issue/Question

During discovery phase, proposal. populate fails for just one node.

proposal.generate:
  nodea:
    The minion function caused an exception: Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/salt/minion.py", line 1455, in _thread_return
        return_data = executor.execute()
      File "/usr/lib/python2.7/site-packages/salt/executors/direct_call.py", line 28, in execute
        return self.func(*self.args, **self.kwargs)
      File "/var/cache/salt/minion/extmods/modules/proposal.py", line 262, in generate
        disks = cephdisks.list_(**kwargs)
      File "/var/cache/salt/minion/extmods/modules/cephdisks.py", line 480, in list_
        return hwd.assemble_device_list()
      File "/var/cache/salt/minion/extmods/modules/cephdisks.py", line 472, in assemble_device_list
        self._preflight_check(hardware)
      File "/var/cache/salt/minion/extmods/modules/cephdisks.py", line 420, in _preflight_check
        raise ValueError("{} is not included in the hardware dict.".format(rf))
    ValueError: Capacity is not included in the hardware dict.

When debugging cephdisks.py we saw the following:

2018-12-10 12:56:38,947 No partitions detected on sdd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/qxn2533/cephdisks.py", line 478, in assemble_device_list
    self._preflight_check(hardware)
  File "/home/qxn2533/cephdisks.py", line 426, in _preflight_check
    raise ValueError("{} is not included in the hardware dict.".format(rf))
ValueError: Capacity is not included in the hardware dict.

Culprit seems to be in assemble_device_list and _query_disktype. In the first a glob is used on /sys/block/*/device. In _query_disktype smartctl is used to to get disk info. In our case, the device existed in /sys/block/*, but because the disk was partially in a deceased state, it didn't exist in /dev. Therefore, smartctl was run with a none existing device name for which it conveniently returns return code 0.

Developers already anticipated on this, but the actual parsing code was not added yet in cephdisks.py.

            for line in proc.stdout:
                # ADD PARSING HERE TO DETECT FAILURE

Versions Report

SLES12SP2 deepsea-0.8.5-2.16.1.noarch salt-2016.11.4-46.20.2.x86_64 ses-release-POOL-5-1.54.x86_64 salt-minion-2016.11.4-46.20.2.x86_64 salt-api-2016.11.4-46.20.2.x86_64 ses-release-5-58.1.x86_64 salt-master-2016.11.4-46.20.2.x86_64

alexdepalex avatar Dec 10 '18 12:12 alexdepalex