DeepSea
DeepSea copied to clipboard
stage.0 does not go through in case of a lost osd
We had a lost OSD on on server (disk broken) and due to that the OSD service is not running.
In this scenario we had to execute stage.0 because of some missing patches and stage.0 got stuck "forever".
On the salt-master we can see this while the stage.0 orchestration hangs:
20180226221953703930:
----------
Arguments:
- ceph.processes
|_
----------
__kwarg__:
True
concurrent:
False
queue:
False
saltenv:
base
Function:
state.sls
Returned:
- osd03-p.ses.intern.thomas-krenn.com
- osd04-p.ses.intern.thomas-krenn.com
- admin-p.ses.intern.thomas-krenn.com
- osd02-p.ses.intern.thomas-krenn.com
Running:
|_
----------
osd01-p.ses.intern.thomas-krenn.com:
1110284
StartTime:
2018, Feb 26 22:19:53.703930
Target:
*
Target-type:
compound
User:
salt
When looking at syslog on the server that hangs (osd01-p in the above example) we see this:
salt-minion[3886]: message repeated 19 times: [ [ERROR ] ERROR: At least one OSD is not running]
So the assumption is that stage.0 does not work in case some OSDs in the cluster are broken.
Is this assumption true and can this be fixed?
that's the expected behavior and part of the precaution that has to be taken during critical situations.
-
in this case (if that is expected) it should error out instead of waiting forever and hang while waiting for all services up (in our case it was waiting forever on the start of a lost OSD - 45 - physically dead / lost disk - you might remember ;-))
-
why don’t we allow to patch in case of a lost OSD? Keep in mind that big clusters always have faults - and while that is the case - maintenance still must be possible
Thoughts?
Am 28.02.2018 um 13:25 schrieb Joshua Schmid [email protected]:
that's the expected behavior and part of the precaution that has to be taken during critical situations.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Martin-Weiss [email protected] wrote on Wed, 28. Feb 13:16:
- in this case (if that is expected) it should error out instead of waiting forever and hang while waiting for all services up (in our case it was waiting forever on the start of a lost OSD - 45 - physically dead / lost disk - you might remember ;-))
It actually doesn't wait forever. It's 900 seconds on bare-metal and 120 on virtualized machines. A QOL improvement that I see is to add an initial check that has no timeout to act like a 'validate'.
- why don’t we allow to patch in case of a lost OSD? Keep in mind that big clusters always have faults - and while that is the case - maintenance still must be possible
We haven't had any feedback if people actually like the enforced precaution measures or not. If the reallife shows us that it's simply not realistic to always have a 'clean' cluster, we have to add flags for the user to disable the strict checks.. which comes with a certain risk..
Maybe there should be a 'tolerable' amount of down/out OSDs.
Thoughts?
Am 28.02.2018 um 14:30 schrieb Joshua Schmid [email protected]:
Martin-Weiss [email protected] wrote on Wed, 28. Feb 13:16:
- in this case (if that is expected) it should error out instead of waiting forever and hang while waiting for all services up (in our case it was waiting forever on the start of a lost OSD - 45 - physically dead / lost disk - you might remember ;-))
It actually doesn't wait forever. It's 900 seconds on bare-metal and 120 on virtualized machines. A QOL improvement that I see is to add an initial check that has no timeout to act like a 'validate'.
Ok - 15 minutes is a very long time - „forever in IT ;-))“
Maybe we can show that somehow as the „hang“ for so long on each host that has a problem might not be expected for an admin and looking at the reason for the hang is not trivial..
Yes - the pre-check similar to firewall / apparmor might help, too.
- why don’t we allow to patch in case of a lost OSD? Keep in mind that big clusters always have faults - and while that is the case - maintenance still must be possible
We haven't had any feedback if people actually like the enforced precaution measures or not. If the reallife shows us that it's simply not realistic to always have a 'clean' cluster, we have to add flags for the user to disable the strict checks.. which comes with a certain risk..
Maybe there should be a 'tolerable' amount of down/out OSDs.
This is nothing we can decide by software automatically I believe. In a large multi datacenter Cluster this is different than in a small cluster and it also might be relevant how many replicate of a impacted pool might get affected etc..
Disengage.safety might be a way..
Martin
Thoughts? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
This is nothing we can decide by software automatically I believe. In a large multi datacenter Cluster this is different than in a small cluster and it also might be relevant how many replicate of a impacted pool might get affected etc..
Right, that's what I ment. You have to get it right, manually.
Ok - 15 minutes is a very long time - „forever in IT ;-))“
Maybe we can show that somehow as the „hang“ for so long on each host that has a problem might not be expected for an admin and looking at the reason for the hang is not trivial..
That'd be not necessary anymore if we add the pre-validation for down services.
We should consider adding this as a QOL improvement.
With regards to the 15 minutes is forever - yeah, I know. Still, I have seen the worse case BIOS + RAID BIOS + other BIOS + actual boot time come close to that for some servers during a reboot.
If this is on an OSD, I wonder if we are hitting a different timeout though. We have the ceph.wait states which wait 5 minutes in some cases and an hour (or multiple) in other cases. These are more or less continually polling (as in, once in a minute).
The general fear is that leaving the timeout too short leaves the administrator with constantly restarting the same steps. If we can make the ceph.wait more intelligent (e.g. it appears to be progressing), that may help. We do the moving window with emptying an OSD. We check that the PGs keep changing at each interval. I do not know what we could use in this case.
tackled with #1174