Nico Schottelius comments

Results 129 comments of


                                            Nico Schottelius

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

I now waited for the 3 prepare jobs to crash (all ending in Error state) and I collected the last lines: server119: ``` [2024-11-19 14:36:52,488][ceph_volume.process][INFO ] stdout DEVLINKS=/dev/disk/by-diskseq/31 /dev/disk/by-id/scsi-36001e6750ff0b0002ecdde9801934a36 /dev/disk/by-id/wwn-0x6001e6750ff0b0002ecdde9801934a36...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

The operator itself seems to aggregate the logs so we can see it nicely in a sequence: ``` 2024-11-19 14:31:32.114982 I | op-osd: OSD orchestration status for node server116 is...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

Just found https://github.com/rook/rook/issues/5835 with a similar issue, but no follow up

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

The way how I solve it at the moment: - cordon all nodes besides one with new disks - Wait for the deployment of the osd to fail (log below)...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

I found osd_id=3 on at least 3 osds, probably more. The following outputs are all from different hosts: ``` "e8ba2b6a-9f9e-4892-9412-34b79e5d971a": { "ceph_fsid": "bd3061a0-ecf3-4af6-9017-51b63c90b526", "device": "/dev/sdf", "osd_id": 3, "osd_uuid": "e8ba2b6a-9f9e-4892-9412-34b79e5d971a", "type":...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

To verify a node has incorrectly osd_id set, I also did: - lsof on the device -> empty - then overwrite it - restart operator - wait for the prepare...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

Coming back to this after we cleaned up the whole cluster, as far as I can see rook's way of creating osds is prone to race conditions if multiple prepare...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

@travisn Re purging: we kept the disk in, but ensured that it's not being touched using `lsof`, then ran `dd ...` with 1 GiB of zeros over it. At the...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

@travisn But maybe to add: in the beginning when this situation started, we added 9 new, fresh SSDs to 4 different hosts. We just noticed it failing with the error...

rook v1.15.5 and v1.15.0 adds wrong keyring to osds repeatedly

From my perspective coming from native ceph towards rook, before the operation was very manual, adding an osd was a job done on a single host and we have script...