ceph-medic icon indicating copy to clipboard operation
ceph-medic copied to clipboard

checks: check for OSD suicide timeouts

Open haklein opened this issue 7 years ago • 2 comments

OSD can hit suicide timeouts for different reasons, it would be great if ceph-medic could highlight such events from the OSD log files.

haklein avatar Sep 21 '17 14:09 haklein

Example messages for thread timeouts:

2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60

There are different threads where this can happen (filestore, op, disk, ..), so the check should be very generic ("had timed out after" and "had suicide timed out after")

haklein avatar Sep 21 '17 14:09 haklein

Is it possible to not have to look at log files. Ceph logs can be tremendously large, ideally some command that could tell us this would be great

alfredodeza avatar Sep 21 '17 14:09 alfredodeza