ceph-medic
ceph-medic copied to clipboard
checks: check for OSD suicide timeouts
OSD can hit suicide timeouts for different reasons, it would be great if ceph-medic could highlight such events from the OSD log files.
Example messages for thread timeouts:
2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60
There are different threads where this can happen (filestore, op, disk, ..), so the check should be very generic ("had timed out after" and "had suicide timed out after")
Is it possible to not have to look at log files. Ceph logs can be tremendously large, ideally some command that could tell us this would be great