microceph icon indicating copy to clipboard operation
microceph copied to clipboard

MicroCeph abrupt OSD daemon shutdown

Open UtkarshBhatthere opened this issue 1 year ago • 1 comments

Issue report

What happens (observed behaviour) ?

Abrupt Ceph OSD daemon shutdown.

What were you expecting to happen ?

Ceph services should not die.

Relevant logs, error output, etc.

Relevant Logs osd0.log

UtkarshBhatthere avatar Feb 26 '24 07:02 UtkarshBhatthere

Here are the unabridged logs (too large to upload directly to GitHub): https://drive.google.com/file/d/1_jHzVzPzUmXgCH6YWQrMsVl5LL7qIeMO/view?usp=sharing

slapcat avatar Feb 26 '24 15:02 slapcat

I realize this is a long time ago -- apologies for not looking at this sooner.

From the mon log I can e.g. see messages about slow ops and a leader election taking 70s around 2024-02-18T01:33:00 on

2024-02-18T01:34:03.851+0000 7fe970160640  0 log_channel(cluster) log [DBG] : mds.0 [v2:10.10.0.40:6800/3127883264,v1:10.10.0.40:6801/3127883264] up:active
2024-02-18T01:34:03.851+0000 7fe970160640  1 mon.meatwad@0(leader).mds e414 check_health: resetting beacon timeouts due to mon delay (slow election?) of 70.2781 seconds
2024-02-18T01:34:03.851+0000 7fe970160640  0 log_channel(cluster) log [DBG] : fsmap nabasny:1 {0=meatwad=up:active}
2024-02-18T01:34:05.455+0000 7fe97516a640 -1 mon.meatwad@0(leader) e2 get_health_metrics reporting 1 slow ops, oldest is log(1 entries from seq 347245 at 2024-02-18T01:33:15.790258+0000)

Also in OSD logs I can see heartbeat timeouts. Subsequently OSDs committed suicide ie. terminated themselves

ceph-osd.0.log.txt:2024-02-18T01:34:11.640+0000 7f27f6cda640  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f27dba47640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dd24a640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dd24a640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dca49640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dca49640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dc248640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dc248640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:57.729+0000 7f27db246640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27db246640' had suicide timed out after 150.000000000s
...
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")

This is expected behaviour for OSDs. When an OSD detects that a particular operation or thread has not responded within a specified timeout period, known as the "suicide timeout," it may decide to terminate itself.

The reason for the election and op slowness as well as the OSD termination could be a network hit, or alternatively resource starvation that would have prevented cluster services communicating properly.

Would it be possible that something like this happened?

sabaini avatar Aug 13 '24 17:08 sabaini

Yes, that could be. Since then I've upgraded the RAM on that machine and haven't noticed any more OSD outages.

slapcat avatar Aug 13 '24 17:08 slapcat

Thanks, I'll close this then -- feel free to reopen if this pops up again

sabaini avatar Aug 13 '24 20:08 sabaini