DeepSea stage.0 does not go through in case of a lost osd

We had a lost OSD on on server (disk broken) and due to that the OSD service is not running.

In this scenario we had to execute stage.0 because of some missing patches and stage.0 got stuck "forever".

On the salt-master we can see this while the stage.0 orchestration hangs:

20180226221953703930:
    ----------
    Arguments:
        - ceph.processes
        |_
          ----------
          __kwarg__:
              True
          concurrent:
              False
          queue:
              False
          saltenv:
              base
    Function:
        state.sls
    Returned:
        - osd03-p.ses.intern.thomas-krenn.com
        - osd04-p.ses.intern.thomas-krenn.com
        - admin-p.ses.intern.thomas-krenn.com
        - osd02-p.ses.intern.thomas-krenn.com
    Running:
        |_
          ----------
          osd01-p.ses.intern.thomas-krenn.com:
              1110284
    StartTime:
        2018, Feb 26 22:19:53.703930
    Target:
        *
    Target-type:
        compound
    User:
        salt

When looking at syslog on the server that hangs (osd01-p in the above example) we see this:

salt-minion[3886]: message repeated 19 times: [ [ERROR ] ERROR: At least one OSD is not running]

So the assumption is that stage.0 does not work in case some OSDs in the cluster are broken.

Is this assumption true and can this be fixed?

Feb 26 '18 21:02 Martin-Weiss

that's the expected behavior and part of the precaution that has to be taken during critical situations.

Feb 28 '18 12:02 jschmid1

in this case (if that is expected) it should error out instead of waiting forever and hang while waiting for all services up (in our case it was waiting forever on the start of a lost OSD - 45 - physically dead / lost disk - you might remember ;-))
why don’t we allow to patch in case of a lost OSD? Keep in mind that big clusters always have faults - and while that is the case - maintenance still must be possible

Thoughts?

Am 28.02.2018 um 13:25 schrieb Joshua Schmid [email protected]:

that's the expected behavior and part of the precaution that has to be taken during critical situations.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Feb 28 '18 13:02 Martin-Weiss

Martin-Weiss [email protected] wrote on Wed, 28. Feb 13:16:

in this case (if that is expected) it should error out instead of waiting forever and hang while waiting for all services up (in our case it was waiting forever on the start of a lost OSD - 45 - physically dead / lost disk - you might remember ;-))

It actually doesn't wait forever. It's 900 seconds on bare-metal and 120 on virtualized machines. A QOL improvement that I see is to add an initial check that has no timeout to act like a 'validate'.

why don’t we allow to patch in case of a lost OSD? Keep in mind that big clusters always have faults - and while that is the case - maintenance still must be possible

We haven't had any feedback if people actually like the enforced precaution measures or not. If the reallife shows us that it's simply not realistic to always have a 'clean' cluster, we have to add flags for the user to disable the strict checks.. which comes with a certain risk..

Maybe there should be a 'tolerable' amount of down/out OSDs.

Thoughts?

Feb 28 '18 13:02 jschmid1

Am 28.02.2018 um 14:30 schrieb Joshua Schmid [email protected]:

Martin-Weiss [email protected] wrote on Wed, 28. Feb 13:16:

in this case (if that is expected) it should error out instead of waiting forever and hang while waiting for all services up (in our case it was waiting forever on the start of a lost OSD - 45 - physically dead / lost disk - you might remember ;-))

It actually doesn't wait forever. It's 900 seconds on bare-metal and 120 on virtualized machines. A QOL improvement that I see is to add an initial check that has no timeout to act like a 'validate'.

Ok - 15 minutes is a very long time - „forever in IT ;-))“

Maybe we can show that somehow as the „hang“ for so long on each host that has a problem might not be expected for an admin and looking at the reason for the hang is not trivial..

Yes - the pre-check similar to firewall / apparmor might help, too.

why don’t we allow to patch in case of a lost OSD? Keep in mind that big clusters always have faults - and while that is the case - maintenance still must be possible

We haven't had any feedback if people actually like the enforced precaution measures or not. If the reallife shows us that it's simply not realistic to always have a 'clean' cluster, we have to add flags for the user to disable the strict checks.. which comes with a certain risk..

Maybe there should be a 'tolerable' amount of down/out OSDs.

This is nothing we can decide by software automatically I believe. In a large multi datacenter Cluster this is different than in a small cluster and it also might be relevant how many replicate of a impacted pool might get affected etc..

Disengage.safety might be a way..

Martin

Thoughts? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Feb 28 '18 13:02 Martin-Weiss

This is nothing we can decide by software automatically I believe. In a large multi datacenter Cluster this is different than in a small cluster and it also might be relevant how many replicate of a impacted pool might get affected etc..

Right, that's what I ment. You have to get it right, manually.

Ok - 15 minutes is a very long time - „forever in IT ;-))“

Maybe we can show that somehow as the „hang“ for so long on each host that has a problem might not be expected for an admin and looking at the reason for the hang is not trivial..

That'd be not necessary anymore if we add the pre-validation for down services.

We should consider adding this as a QOL improvement.

Feb 28 '18 14:02 jschmid1

With regards to the 15 minutes is forever - yeah, I know. Still, I have seen the worse case BIOS + RAID BIOS + other BIOS + actual boot time come close to that for some servers during a reboot.

If this is on an OSD, I wonder if we are hitting a different timeout though. We have the ceph.wait states which wait 5 minutes in some cases and an hour (or multiple) in other cases. These are more or less continually polling (as in, once in a minute).

The general fear is that leaving the timeout too short leaves the administrator with constantly restarting the same steps. If we can make the ceph.wait more intelligent (e.g. it appears to be progressing), that may help. We do the moving window with emptying an OSD. We check that the PGs keep changing at each interval. I do not know what we could use in this case.

Mar 01 '18 21:03 swiftgist

tackled with #1174

Aug 29 '18 10:08 jschmid1

DeepSea DeepSea copied to clipboard

stage.0 does not go through in case of a lost osd

DeepSea
DeepSea copied to clipboard