microceph icon indicating copy to clipboard operation
microceph copied to clipboard

Mon out of Quorum Recovery

Open jonnypeace opened this issue 2 years ago • 9 comments

I am having a little bit trouble with a mon being out of quorum. I have a cluster of 3, 2 are still in quorum and i'm unsure how to recover.

I had a look at the ceph documentation and I was going to try removing and adding a mon, but it first requires this...

ceph-mon -i `hostname` --extract-monmap /tmp/monmap

I'm not sure what the equivalent would be for microceph? I tried ceph-mon without the dash in the middle, but the command appears to hang (following ceph documentation, you are required to systemctl stop ceph-mon.target on each node. I tried...

sudo systemctl stop snap.microceph.mon.service

But i'm not sure if this is equivalent to ceph documentation.

health status....

      cluster:
        id:     a832124c-ee14-4a2d-b6d9-2a1b8326a560
        health: HEALTH_WARN
                1/3 mons down, quorum uby2,uby1
                Degraded data redundancy: 8994/26982 objects degraded (33.333%), 16 pgs degraded, 16 pgs undersized
                12 pgs not deep-scrubbed in time
                16 pgs not scrubbed in time

      services:
        mon: 3 daemons, quorum uby2,uby1 (age 15s), out of quorum: uby3
        mgr: uby1(active, since 25h), standbys: uby2
        osd: 3 osds: 2 up (since 25h), 2 in (since 10d)

      data:
        pools:   1 pools, 16 pgs
        objects: 8.99k objects, 27 GiB
        usage:   46 GiB used, 431 GiB / 477 GiB avail
        pgs:     8994/26982 objects degraded (33.333%)
                 16 active+undersized+degraded

      io:
        client:   0 B/s rd, 6.0 KiB/s wr, 0 op/s rd, 0 op/s wr

I'm not sure if this is an issue, or if my incompetence, but couldn't find documentation for this (i might be looking in the wrong place...)

It's been like this for 24hrs at least, but I've not been checking this cluster for a couple of weeks. When I check the logs, nothing stood out to me, except the last entry being 27th September, and i couldn't find bus errors, or corruption errors.

The last 20 entries...

2023-09-27T23:14:42.441+0100 7f24051ec640  4 rocksdb: [db/compaction/compaction_job.cc:1594] [default] [JOB 12248] Compacted 1@0 + 1@6 files to L6 => 28530906 bytes
2023-09-27T23:14:42.445+0100 7f24051ec640  4 rocksdb: (Original Log Time 2023/09/27-23:14:42.448932) [db/compaction/compaction_job.cc:812] [default] compacted to: base level 6 level multiplier 10.00 max bytes base 268435456 files[0 0 0 0 0 0 1] max score 0.00, MB/sec: 109.9 rd, 108.3 wr, level 6, files in(1, 1) out(1) MB in(0.7, 26.9) out(27.2), read-write-amplify(80.4) write-amplify(39.9) OK, records in: 24363, records dropped: 724 output_compression: NoCompression

2023-09-27T23:14:42.445+0100 7f24051ec640  4 rocksdb: (Original Log Time 2023/09/27-23:14:42.448999) EVENT_LOG_v1 {"time_micros": 1695852882448972, "job": 12248, "event": "compaction_finished", "compaction_time_micros": 263337, "compaction_time_cpu_micros": 147058, "output_level": 6, "num_output_files": 1, "total_output_size": 28530906, "num_input_records": 24363, "num_output_records": 23639, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 1]}
2023-09-27T23:14:42.445+0100 7f24051ec640  4 rocksdb: [file/delete_scheduler.cc:69] Deleted file /var/lib/ceph/mon/ceph-uby3/store.db/605896.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
2023-09-27T23:14:42.445+0100 7f24051ec640  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1695852882449791, "job": 12248, "event": "table_file_deletion", "file_number": 605896}
2023-09-27T23:14:42.461+0100 7f24051ec640  4 rocksdb: [file/delete_scheduler.cc:69] Deleted file /var/lib/ceph/mon/ceph-uby3/store.db/605894.sst immediately, rate_bytes_per_sec 0, total_trash_size 0 max_trash_db_ratio 0.250000
2023-09-27T23:14:42.461+0100 7f24051ec640  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1695852882465134, "job": 12248, "event": "table_file_deletion", "file_number": 605894}
2023-09-27T23:14:42.461+0100 7f23fb9d9640  4 rocksdb: [db/db_impl/db_impl_compaction_flush.cc:1615] [default] Manual compaction starting
2023-09-27T23:14:42.461+0100 7f23fb9d9640  4 rocksdb: [db/db_impl/db_impl_compaction_flush.cc:1615] [default] Manual compaction starting
2023-09-27T23:14:42.461+0100 7f23fb9d9640  4 rocksdb: [db/db_impl/db_impl_compaction_flush.cc:1615] [default] Manual compaction starting
2023-09-27T23:14:42.461+0100 7f23fb9d9640  4 rocksdb: [db/db_impl/db_impl_compaction_flush.cc:1615] [default] Manual compaction starting
2023-09-27T23:14:42.461+0100 7f23fb9d9640  4 rocksdb: [db/db_impl/db_impl_compaction_flush.cc:1615] [default] Manual compaction starting
2023-09-27T23:15:33.170+0100 7f24059ed640 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
2023-09-27T23:15:33.182+0100 7f24059ed640 -1 mon.uby3@2(peon) e4 *** Got Signal Terminated ***
2023-09-27T23:15:33.182+0100 7f24059ed640  1 mon.uby3@2(peon) e4 shutdown
2023-09-27T23:15:33.410+0100 7f2406c31d40  1 rocksdb: close waiting for compaction thread to stop
2023-09-27T23:15:33.418+0100 7f2406c31d40  1 rocksdb: close compaction thread to stopped
2023-09-27T23:15:33.434+0100 7f2406c31d40  4 rocksdb: [db/db_impl/db_impl.cc:446] Shutdown: canceling all background work
2023-09-27T23:15:33.486+0100 7f2406c31d40  4 rocksdb: [db/db_impl/db_impl.cc:625] Shutdown complete

i've grep'd the last lines to see if they are unusual entries and they don't seem to be, so i'm guessing i might have rebooted the server on the 27th before going to bed, or something like this and it fell over.

Systemctl status looks like this...

× snap.microceph.daemon.service - Service for snap application microceph.daemon
     Loaded: loaded (/etc/systemd/system/snap.microceph.daemon.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2023-10-08 19:29:40 BST; 2h 7min ago
    Process: 3873 ExecStart=/usr/bin/snap run microceph.daemon (code=exited, status=1/FAILURE)
   Main PID: 3873 (code=exited, status=1/FAILURE)
        CPU: 87ms

Oct 08 19:29:40 uby3 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 6.
Oct 08 19:29:40 uby3 systemd[1]: Stopped Service for snap application microceph.daemon.
Oct 08 19:29:40 uby3 systemd[1]: snap.microceph.daemon.service: Start request repeated too quickly.
Oct 08 19:29:40 uby3 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Oct 08 19:29:40 uby3 systemd[1]: Failed to start Service for snap application microceph.daemon.

This is the mon_status... I'm not sure if i should be concerned, but i am unable to run this command on the other ~~2 nodes~~ node without a socket warning, no file present...

admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

mon_status....

{
    "name": "uby1",
    "rank": 1,
    "state": "peon",
    "election_epoch": 690,
    "quorum": [
        0,
        1
    ],
    "quorum_age": 787,
    "features": {
        "required_con": "2449958755906961412",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging",
            "quincy"
        ],
        "quorum_con": "4540138320759226367",
        "quorum_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging",
            "quincy"
        ]
    },
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 4,
        "fsid": "a832124c-ee14-4a2d-b6d9-2a1b8326a560",
        "modified": "2023-01-06T17:08:23.097085Z",
        "created": "2023-01-06T17:06:46.508041Z",
        "min_mon_release": 17,
        "min_mon_release_name": "quincy",
        "election_strategy": 1,
        "disallowed_leaders: ": "",
        "stretch_mode": false,
        "tiebreaker_mon": "",
        "removed_ranks: ": "",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus",
                "octopus",
                "pacific",
                "elector-pinging",
                "quincy"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "uby2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.10.101.42:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.10.101.42:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.10.101.42:6789/0",
                "public_addr": "10.10.101.42:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 1,
                "name": "uby1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.10.101.41:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.10.101.41:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.10.101.41:6789/0",
                "public_addr": "10.10.101.41:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 2,
                "name": "uby3",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.10.101.43:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.10.101.43:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.10.101.43:6789/0",
                "public_addr": "10.10.101.43:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "osd": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "client": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 3
            }
        ],
        "mgr": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ]
    },
    "stretch_mode": false
}

Update: just realised my second node is responding to mon_status, apologies...

{
    "name": "uby2",
    "rank": 0,
    "state": "leader",
    "election_epoch": 690,
    "quorum": [
        0,
        1
    ],
    "quorum_age": 1439,
    "features": {
        "required_con": "2449958755906961412",
        "required_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging",
            "quincy"
        ],
        "quorum_con": "4540138320759226367",
        "quorum_mon": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus",
            "octopus",
            "pacific",
            "elector-pinging",
            "quincy"
        ]
    },
    "outside_quorum": [],
    "extra_probe_peers": [],
    "sync_provider": [],
    "monmap": {
        "epoch": 4,
        "fsid": "a832124c-ee14-4a2d-b6d9-2a1b8326a560",
        "modified": "2023-01-06T17:08:23.097085Z",
        "created": "2023-01-06T17:06:46.508041Z",
        "min_mon_release": 17,
        "min_mon_release_name": "quincy",
        "election_strategy": 1,
        "disallowed_leaders: ": "",
        "stretch_mode": false,
        "tiebreaker_mon": "",
        "removed_ranks: ": "",
        "features": {
            "persistent": [
                "kraken",
                "luminous",
                "mimic",
                "osdmap-prune",
                "nautilus",
                "octopus",
                "pacific",
                "elector-pinging",
                "quincy"
            ],
            "optional": []
        },
        "mons": [
            {
                "rank": 0,
                "name": "uby2",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.10.101.42:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.10.101.42:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.10.101.42:6789/0",
                "public_addr": "10.10.101.42:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 1,
                "name": "uby1",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.10.101.41:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.10.101.41:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.10.101.41:6789/0",
                "public_addr": "10.10.101.41:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            },
            {
                "rank": 2,
                "name": "uby3",
                "public_addrs": {
                    "addrvec": [
                        {
                            "type": "v2",
                            "addr": "10.10.101.43:3300",
                            "nonce": 0
                        },
                        {
                            "type": "v1",
                            "addr": "10.10.101.43:6789",
                            "nonce": 0
                        }
                    ]
                },
                "addr": "10.10.101.43:6789/0",
                "public_addr": "10.10.101.43:6789/0",
                "priority": 0,
                "weight": 0,
                "crush_location": "{}"
            }
        ]
    },
    "feature_map": {
        "mon": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "mds": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 2
            }
        ],
        "osd": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "client": [
            {
                "features": "0x2f018fb87aa4aafe",
                "release": "luminous",
                "num": 2
            },
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ],
        "mgr": [
            {
                "features": "0x3f01cfbf7ffdffff",
                "release": "luminous",
                "num": 1
            }
        ]
    },
    "stretch_mode": false
}

jonnypeace avatar Oct 08 '23 20:10 jonnypeace

Thanks for the detailed issue report @jonnypeace, I will follow through this and update here.

UtkarshBhatthere avatar Oct 16 '23 11:10 UtkarshBhatthere

Thanks, I'm no longer using the cluster. I tried repairing it, but seemed to make things worse the more I did. This cascaded into the LXD cluster itself and decided it wasn't worth more of my time to repair. I just transferred all the containers to other lxd machines and moved on. I should still have access to logs with the node that initially fell over if needed.

jonnypeace avatar Oct 16 '23 13:10 jonnypeace

I have the exact same issue. The 3 nodes cluster was working fine for 2 weeks, then one mon left the quorum.

Checking syslog, I found the following logs for each attempt to start mon:

Dec 12 16:01:50 ns3209908 systemd[1]: Started Service for snap application microceph.mon. Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: warning: unable to create /var/snap/microceph/798/run No such file or directory Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error' Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: what(): filesystem error: cannot set permissions: No such file or directory [/var/snap/microceph/798/run] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: in thread 7fc1d24e8e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1d2c37520] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1d2fc8b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1d2fd420c] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1d2fd4277] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1d2fd44d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1d2fccff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x55713d4edfec] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1d2c1ed90] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 2023-12-12T16:01:50.116+0100 7fc1d24e8e80 -1 *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: in thread 7fc1d24e8e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1d2c37520] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1d2fc8b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1d2fc8b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1d2fd420c] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1d2fd4277] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1d2fd44d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1d2fccff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x55713d4edfec] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1d2c1ed90] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: -26> 2023-12-12T16:01:50.116+0100 7fc1d24e8e80 -1 *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: in thread 7fc1d24e8e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1d2c37520] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1d2fc8b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1d2fd420c] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1d2fd4277] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1d2fd44d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1d2fccff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x55713d4edfec] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1d2c1ed90] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: -26> 2023-12-12T16:01:50.116+0100 7fc1d24e8e80 -1 *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: in thread 7fc1d24e8e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1d2c37520] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1d2fc8b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1d2fd420c] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1d2fd4277] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1d2fd44d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1d2fccff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x55713d4edfec] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1d2c1ed90] Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000517]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Dec 12 16:01:50 ns3209908 systemd[1]: snap.microceph.mon.service: Main process exited, code=dumped, status=6/ABRT Dec 12 16:01:50 ns3209908 systemd[1]: snap.microceph.mon.service: Failed with result 'core-dump'. Dec 12 16:01:50 ns3209908 systemd[1]: snap.microceph.mon.service: Scheduled restart job, restart counter is at 2. Dec 12 16:01:50 ns3209908 systemd[1]: Stopped Service for snap application microceph.mon. Dec 12 16:01:50 ns3209908 systemd[1]: Started Service for snap application microceph.mon. Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: warning: unable to create /var/snap/microceph/798/runNo such file or directory Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error' Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: what(): filesystem error: cannot set permissions: No such file or directory [/var/snap/microceph/798/run] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: in thread 7fc1f9ae3e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1fa232520] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1fa5c3b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1fa5cf20c] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1fa5cf277] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1fa5cf4d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1fa5c7ff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x555d0df96fec] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1fa219d90] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 2023-12-12T16:01:50.616+0100 7fc1f9ae3e80 -1 *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: in thread 7fc1f9ae3e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1fa232520] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1fa5c3b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1fa5cf20c] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1fa5cf277] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1fa5cf4d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1fa5c7ff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x555d0df96fec] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1fa219d90] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: -26> 2023-12-12T16:01:50.616+0100 7fc1f9ae3e80 -1 *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: in thread 7fc1f9ae3e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1fa232520] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1fa5c3b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1fa5cf20c] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1fa5cf277] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1fa5cf4d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1fa5c7ff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x555d0df96fec] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1fa219d90] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: -26> 2023-12-12T16:01:50.616+0100 7fc1f9ae3e80 -1 *** Caught signal (Aborted) ** Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: in thread 7fc1f9ae3e80 thread_name:ceph-mon Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc1fa232520] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 2: pthread_kill() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 3: raise() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 4: abort() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7fc1fa5c3b9e] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7fc1fa5cf20c] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7fc1fa5cf277] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7fc1fa5cf4d8] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7fc1fa5c7ff4] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x12ac) [0x555d0df96fec] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 11: main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc1fa219d90] Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 13: __libc_start_main() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: 14: _start() Dec 12 16:01:50 ns3209908 microceph.mon[3000580]: NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this. Dec 12 16:01:50 ns3209908 systemd[1]: snap.microceph.mon.service: Main process exited, code=dumped, status=6/ABRT Dec 12 16:01:50 ns3209908 systemd[1]: snap.microceph.mon.service: Failed with result 'core-dump'. Dec 12 16:01:50 ns3209908 systemd[1]: snap.microceph.mon.service: Scheduled restart job, restart counter is at 3. Dec 12 16:01:50 ns3209908 systemd[1]: Stopped Service for snap application microceph.mon. Dec 12 16:01:50 ns3209908 systemd[1]: Started Service for snap application microceph.mon. Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: warning: unable to create /var/snap/microceph/798/run No such file or directory Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error' Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: what(): filesystem error: cannot set permissions: No such file or directory [/var/snap/microceph/798/run] Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: *** Caught signal (Aborted) ** Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: in thread 7f0cb5df8e80 thread_name:ceph-mon Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f0cb6547520] Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 2: pthread_kill() Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 3: raise() Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 4: abort() Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7f0cb68d8b9e] Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f0cb68e420c] Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f0cb68e4277] Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7f0cb68e44d8] Dec 12 16:01:51 ns3209908 microceph.mon[3000646]: 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7f0cb68dcff4] ...

It turns out there was no /var/snap/microceph/798 on the server, only 822 and 827. I fixed the issue by creating the missing dir with "mkdir -p /var/snap/microceph/798/run" before issuing "snap restart microceph". The cluster gracefully recovered.

The root cause seems to be that one node upgraded the microceph snap to Reef while the two in quorum were in Quincy. There was possibly an issue during the upgrade.

MichaelPetre avatar Dec 12 '23 16:12 MichaelPetre

@MichaelPetre thank you very much for sharing the problem and the workaround. 1 question: Are you consuming MicroCeph via juju ? or directly via the snap ? refreshing the snaps is a manual process and is expected to be run for all the MicroCeph nodes.

UtkarshBhatthere avatar Dec 13 '23 10:12 UtkarshBhatthere

@UtkarshBhatthere I'm consuming it via the snap. I have the nagging suspicion that an idiot (read: me) forgot to type "snap refresh --hold microceph" on that node. I manually upgraded the whole cluster to Reef, made sure the microceph snap was held on all nodes and it's working just fine now.

MichaelPetre avatar Dec 14 '23 18:12 MichaelPetre

I'll add my experience this morning, refreshing the microceph snap from 707 to 793. I encountered the error because one of the nodes had its "run dir" set to "/var/snap/microceph/550/run", which had been the previous rev. I believe this is what the OP encountered as well. The issue will not crop up by default until the second refresh, since snap keeps one previous rev by default. My OSD was using "550" as the path, and thus survived the refresh from 550->707, but not 707->793.

Jan 11 18:04:09 g3-3 systemd[1]: Reloading.
Jan 11 18:04:09 g3-3 systemd[1]: Mounting Mount unit for microceph, revision 793...
Jan 11 18:04:09 g3-3 kernel: [  474.926820] loop12: detected capacity change from 0 to 187536
Jan 11 18:04:09 g3-3 systemd[1]: Mounted Mount unit for microceph, revision 793.
Jan 11 18:04:10 g3-3 systemd[1]: Stopping Service for snap application microceph.osd...
Jan 11 18:04:10 g3-3 kernel: [  475.168559] libceph: osd2 (1)10.1.2.37:6801 socket closed (con state OPEN)
Jan 11 18:04:10 g3-3 kernel: [  475.198689] libceph: osd2 (1)10.1.2.37:6801 socket closed (con state V1_BANNER)
Jan 11 18:04:10 g3-3 systemd[1]: snap.microceph.osd.service: Deactivated successfully.
Jan 11 18:04:10 g3-3 systemd[1]: Stopped Service for snap application microceph.osd.
Jan 11 18:04:10 g3-3 systemd[1]: snap.microceph.osd.service: Consumed 46.513s CPU time.
Jan 11 18:04:10 g3-3 snapd[816]: services.go:1103: RemoveSnapServices - disabling snap.microceph.mgr.service
Jan 11 18:04:10 g3-3 snapd[816]: services.go:1103: RemoveSnapServices - disabling snap.microceph.daemon.service
Jan 11 18:04:10 g3-3 snapd[816]: services.go:1103: RemoveSnapServices - disabling snap.microceph.mon.service
Jan 11 18:04:10 g3-3 snapd[816]: services.go:1103: RemoveSnapServices - disabling snap.microceph.osd.service
Jan 11 18:04:10 g3-3 snapd[816]: services.go:1103: RemoveSnapServices - disabling snap.microceph.rgw.service
Jan 11 18:04:10 g3-3 snapd[816]: services.go:1103: RemoveSnapServices - disabling snap.microceph.mds.service
Jan 11 18:04:10 g3-3 systemd[1]: Reloading.
Jan 11 18:04:10 g3-3 kernel: [  475.447001] libceph: osd2 (1)10.1.2.37:6801 socket error on write
Jan 11 18:04:10 g3-3 kernel: [  475.950782] libceph: osd2 down
Jan 11 18:04:12 g3-3 kernel: [  477.524567] audit: type=1400 audit(1704996252.501:193): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/20671/usr/lib/snapd/snap-confine" pid>
Jan 11 18:04:12 g3-3 kernel: [  477.542998] audit: type=1400 audit(1704996252.521:194): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/20671/usr/lib/snapd/snap-confine//mou>
Jan 11 18:04:12 g3-3 kernel: [  477.596306] audit: type=1400 audit(1704996252.573:195): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap-update-ns.microceph" pid=12064 comm="apparmo>
Jan 11 18:04:12 g3-3 kernel: [  477.708150] audit: type=1400 audit(1704996252.685:196): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.ceph" pid=12065 comm="apparmor_par>
Jan 11 18:04:12 g3-3 kernel: [  477.760814] audit: type=1400 audit(1704996252.737:197): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.daemon" pid=12066 comm="apparmor_p>
Jan 11 18:04:12 g3-3 kernel: [  477.822346] audit: type=1400 audit(1704996252.797:198): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.hook.install" pid=12067 comm="appa>
Jan 11 18:04:12 g3-3 kernel: [  477.874606] audit: type=1400 audit(1704996252.849:199): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.hook.post-refresh" pid=12068 comm=>
Jan 11 18:04:12 g3-3 kernel: [  477.968984] audit: type=1400 audit(1704996252.945:200): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mds" pid=12069 comm="apparmor_pars>
Jan 11 18:04:13 g3-3 kernel: [  478.024280] audit: type=1400 audit(1704996253.001:201): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mgr" pid=12070 comm="apparmor_pars>
Jan 11 18:04:13 g3-3 kernel: [  478.128320] audit: type=1400 audit(1704996253.105:202): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.microceph" pid=12073 comm="apparmo>
Jan 11 18:04:13 g3-3 systemd[1]: Reloading.
Jan 11 18:04:14 g3-3 systemd[1]: Started snap.microceph.hook.post-refresh-a323fa72-f97d-4d57-8334-e6c1e9ff12f0.scope.
Jan 11 18:04:14 g3-3 systemd[1]: snap.microceph.hook.post-refresh-a323fa72-f97d-4d57-8334-e6c1e9ff12f0.scope: Deactivated successfully.
Jan 11 18:04:14 g3-3 systemd[1]: Reloading.
Jan 11 18:04:14 g3-3 systemd[1]: Started Service for snap application microceph.daemon.
Jan 11 18:04:14 g3-3 systemd[1]: Started Service for snap application microceph.osd.
Jan 11 18:04:14 g3-3 systemd[1]: snap-microceph-550.mount: Deactivated successfully.
Jan 11 18:04:14 g3-3 systemd[1]: Reloading.
Jan 11 18:04:15 g3-3 snapd[816]: storehelpers.go:791: cannot refresh snap "microceph": snap has no updates available
Jan 11 18:04:17 g3-3 microceph.daemon[12180]: time="2024-01-11T18:04:17Z" level=info msg="Daemon stopped"
Jan 11 18:04:17 g3-3 microceph.daemon[12180]: Error: Unable to start daemon: Daemon failed to start: Failed to initialize trust store: Failed to parse local record "". Found empty certificate
Jan 11 18:04:17 g3-3 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jan 11 18:04:17 g3-3 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Jan 11 18:04:17 g3-3 systemd[1]: snap.microceph.daemon.service: Consumed 1.522s CPU time.
Jan 11 18:04:17 g3-3 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 1.
Jan 11 18:04:17 g3-3 systemd[1]: Stopped Service for snap application microceph.daemon.
Jan 11 18:04:17 g3-3 systemd[1]: snap.microceph.daemon.service: Consumed 1.522s CPU time.
Jan 11 18:04:17 g3-3 systemd[1]: Started Service for snap application microceph.daemon.
Jan 11 18:04:17 g3-3 kernel: [  482.949193] kauditd_printk_skb: 17 callbacks suppressed
Jan 11 18:04:17 g3-3 kernel: [  482.949195] audit: type=1400 audit(1704996257.925:220): apparmor="DENIED" operation="capable" profile="/snap/snapd/20671/usr/lib/snapd/snap-confine" pid=12639 comm="snap-confine">
Jan 11 18:04:17 g3-3 microceph.osd[12234]: warning: unable to create /var/snap/microceph/550/runNo such file or directory
Jan 11 18:04:17 g3-3 microceph.daemon[12639]: time="2024-01-11T18:04:17Z" level=info msg="Daemon stopped"
Jan 11 18:04:17 g3-3 microceph.daemon[12639]: Error: Unable to start daemon: Daemon failed to start: Failed to initialize trust store: Failed to parse local record "". Found empty certificate
Jan 11 18:04:17 g3-3 microceph.osd[12234]: terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
Jan 11 18:04:17 g3-3 microceph.osd[12234]:   what():  filesystem error: cannot set permissions: No such file or directory [/var/snap/microceph/550/run]
Jan 11 18:04:17 g3-3 microceph.osd[12234]: *** Caught signal (Aborted) **
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  in thread 7f6baddb6800 thread_name:ceph-osd
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f6bae021520]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  2: pthread_kill()
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  3: raise()
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  4: abort()
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e) [0x7f6bae3b0b9e]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f6bae3bc20c]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f6bae3bc277]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8) [0x7f6bae3bc4d8]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa6ff4) [0x7f6bae3b4ff4]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, st>
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  11: main()
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6bae008d90]
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  13: __libc_start_main()
Jan 11 18:04:17 g3-3 microceph.osd[12234]:  14: _start()

The issue appears to be that "run dir" was set to a specific rev, rather than "/var/snap/microceph/current/run", though maybe there are good reasons not to use the "current" path? Please advise if that is something I will need to change manually, as otherwise I expected the snap refresh process to update whatever it needed for a proper transition.

While holding snaps to maintain stability in a cluster is wise, I don't think that contributed to the issue. I had the snaps held in a cohort and was manually upgrading.

tjwsl avatar Jan 11 '24 18:01 tjwsl

I was just bitten by this again today while refreshing from rev 862 to 948. This time, /var/snap/microceph/current/conf/ceph.conf still had "run dir = /var/snap/microceph/793/run" from the previous episode on all four cluster nodes. I suppose the issue went undetected from 793 to 862 because the 793 directory was preserved at the time.

Should that value be referencing specific revs, or should it be set to "current" to follow the symlink?

tjwsl avatar Mar 13 '24 18:03 tjwsl

Thanks @tjwsl , it looks like ceph.conf didn't get regenerated as it should have been. Would you be able to restart microceph services sudo snap restart microceph and send log entries pertaining to microceph, something like fgrep microceph /var/log/syslog? Even if you can't restart microceph the log entries would be interesting. Thanks again.

sabaini avatar Mar 13 '24 20:03 sabaini

I ran into this issue today because I was intentionally attempting to change a mon IP. By using the snap shell, I was able to get and update the mon map, but was not able to re-inject it:

root@mceph-1:~# snap run --shell microceph.mon
root@mceph-1:/root# ceph mon getmap -o /tmp/monmap
got monmap epoch 4
root@mceph-1:/root# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 4
fsid 1ae29786-95a3-418e-87f1-c50d67bf7701
last_changed 2024-07-07T16:27:34.017100+0000
created 2024-07-07T16:26:17.976436+0000
min_mon_release 17 (quincy)
election_strategy: 1
0: [v2:10.222.55.143:3300/0,v1:10.222.55.143:6789/0] mon.mceph-1
1: [v2:10.222.55.183:3300/0,v1:10.222.55.183:6789/0] mon.mceph-2
2: [v2:10.222.55.48:3300/0,v1:10.222.55.48:6789/0] mon.mceph-3
root@mceph-1:/root# monmaptool --rm mceph-3 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: removing mceph-3
monmaptool: writing epoch 4 to /tmp/monmap (2 monitors)
root@mceph-1:/root# monmaptool --addv mceph-3 [v2:10.10.0.50:3300/0,v1:10.10.0.50:6789/0] /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: writing epoch 4 to /tmp/monmap (3 monitors)
root@mceph-1:/root# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 4
fsid 1ae29786-95a3-418e-87f1-c50d67bf7701
last_changed 2024-07-07T16:27:34.017100+0000
created 2024-07-07T16:26:17.976436+0000
min_mon_release 17 (quincy)
election_strategy: 1
0: [v2:10.222.55.143:3300/0,v1:10.222.55.143:6789/0] mon.mceph-1
1: [v2:10.222.55.183:3300/0,v1:10.222.55.183:6789/0] mon.mceph-2
2: [v2:10.10.0.50:3300/0,v1:10.10.0.50:6789/0] mon.mceph-3
root@mceph-1:/root# ceph-mon -i mceph-1 --inject-monmap /tmp/monmap
2024-07-07T16:37:16.559+0000 7fdee6a06d40 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/ceph-mceph-1/store.db/LOCK: Resource temporarily unavailable
2024-07-07T16:37:16.559+0000 7fdee6a06d40 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-mceph-1': (22) Invalid argument

It would be nice if we could find a way to inject it so that manual monmap edits are possible with microceph.

As a workaround, I used the microceph tooling to remove and readd the node. This can be a huge pain though if the node has OSDs with any significant amount of data. Also, if you have only 3 mons, you'll need to deploy a temporary 4th mon while removing and readding the problem node so that you always have a minimum of 3.

slapcat avatar Jul 07 '24 16:07 slapcat