rook
rook copied to clipboard
Cluster unavailable after node reboot, symlink already exist
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: I'm using Rook Ceph with specifics devices, identified by ids
helm_cephrook_nodes_devices:
- name: "vm-kube-slave-1"
devices:
- name: "/dev/disk/by-id/scsi-36000c29d381154d5114acf6c54b09ab5"
[.......]
Linux disk letter sdX can change when rebooting, and should not break the application
Actually, when starting the OSD, the init container activate
detects the right new disk, but a symlink is already present to the old one
'
found device: /dev/sdg
+ DEVICE=/dev/sdg
+ [[ -z /dev/sdg ]]
+ ceph-volume raw activate --device /dev/sdg --no-systemd --no-tmpfs
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-3 --no-mon-config --dev /dev/sdg
Running command: /usr/bin/chown -R ceph:ceph /dev/sdg
Running command: /usr/bin/ln -s /dev/sdg /var/lib/ceph/osd/ceph-3/block
stderr: ln: failed to create symbolic link '/var/lib/ceph/osd/ceph-3/block': File exists
Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 166, in main
systemd=not self.args.no_systemd)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 88, in activate
systemd=systemd)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 48, in activate_bluestore
prepare_utils.link_block(meta['device'], osd_id)
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 371, in link_block
_link_device(block_device, 'block', osd_id)
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 339, in _link_device
process.run(command)
File "/usr/lib/python3.6/site-packages/ceph_volume/process.py", line 147, in run
raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 1
Expected behavior: Rook Ceph detect the good disk when the node reboot, even if the letter sdX change the symlink should be recreated
How to reproduce it (minimal and precise):
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml
, if necessary
Logs to submit:
-
Operator's logs, if necessary
-
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use theinsert code
button from the Github UI. Read GitHub documentation if you need help.
Cluster Status to submit:
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 21 pgs inactive; 570 slow ops, oldest one blocked for 125431 sec, daemons [osd.1,osd.2,osd.4] have slow ops.
sh-4.4$ ceph status
cluster:
id: ecf8035e-5899-4327-9a70-b86daac1f642
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Reduced data availability: 21 pgs inactive
570 slow ops, oldest one blocked for 125447 sec, daemons [osd.1,osd.2,osd.4] have slow ops.
services:
mon: 1 daemons, quorum a (age 3d)
mgr: a(active, since 114m)
mds: 1/1 daemons up, 1 hot standby
osd: 5 osds: 3 up (since 66m), 3 in (since 8h)
data:
volumes: 1/1 healthy
pools: 3 pools, 49 pgs
objects: 186 objects, 45 MiB
usage: 347 MiB used, 150 GiB / 150 GiB avail
pgs: 42.857% pgs unknown
28 active+clean
21 unknown
Environment:
- OS (e.g. from /etc/os-release):
NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)"
- Kernel (e.g.
uname -a
):Linux vm-kube-slave-6 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
- Cloud provider or hardware configuration:
- Rook version (use
rook version
inside of a Rook Pod): 1.9.7 - Storage backend version (e.g. for ceph do
ceph -v
): filesystem - Kubernetes version (use
kubectl version
): 1.23 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE
- Storage backend status (e.g. for Ceph use
ceph health
in the Rook Ceph toolbox):
The drive symbol may change due to disk replacement or re-plugging, etc., but the path id
of the same disk will not change. Is it possible to use path id
instead of drive symbol when executing ceph commands? @satoru-takeuchi
Or maybe force the symlink or remove the old one, if not possible in the ceph command?
@microyahoo Although I don't recall the reason now, we should use kernel name here. I'll investigate how to resolve/mitigate your issue.
@lerminou Thank you for your hint. I'll check whether your suggestion work. It might cause a kind of race.
I'm still investigating this issue. This problem might be in ceph...
In addition to finding the root cause, I'm trying to find a workaround.
Sorry for the delay, I didn't have enough time to work on this issue.
You can resolve this problem after encountering this problem.
- Stop the operator pod by
kubectl scale deloy rook-ceph-operator --replicas=0
- Stop the osd pod by
kubectl scale deploy rook-ceph-osd-<osd ID> --replicas=0
- Delete the simlink to the device file corresponding to the problematic OSD, in your case,
/var/lib/rook/rook-ceph/<osd id>/block
- Restart the osd pod by
kubectl scale deploy rook-ceph-osd-<osd ID> --replicas=1
- Restart the operator pod by
kubectl scale deloy rook-ceph-operator --replicas=1
Then the new osd pod will create the correct symlink.
Hi @satoru-takeuchi, Yes this is my actual workaround, but the cluster is unavailable during the detection/fix frame
Yes this is my actual workaround,
Great.
but the cluster is unavailable during the detection/fix frame
Of course, I'm trying to create a PR to fix this problem.
The logic in which this bug exists is a bit complicated. Please wait for a while.
This problem was introduced by my commit.
The logic in which this bug exists is a bit complicated. Please wait for a while.
This problem was introduced by my commit.
@satoru-takeuchi Do you have more thoughts about how common this issue might be? Since your commit was a while ago, perhaps it is not a common case?
@travisn
I guess that it's not so common in small clusters and the possibility get higher in large clusters. This problem seems to hapens iff the target of /var/lib/ceph/ceph-<n>/block
is a non existent block device file.
Here is an example when there are two scratch devices, B and C and they are bound to device files "sdb" and "sdc".
- Create an OSD on top of device C. Here "sdc" is specified in CephCluster CR and ".../block" points to "sdc."
- device B becomes unavailable by some reasons (e.g. device failure or unplug this disk). The probability of this step depends on the scale of each cluster.
- A device name change happens because of the reduction of the number of devices. Then device C is bounf to "sdb". However, ".../block" still points to missing "sdc".
I verified this problem actually happened in my test env. In addition, I verified that this problem didn't not to happen when flipping device names(e.g. device B is bound to "sdc" and device C bound to "sdb").
The key factor is the reduction of the number of a device andon of ".../block" files becomes dangling symlink.
Although this problem might also be in OSD on PVC, I didn't confirm yet.
My next actions are...
- Read the OSD on decice code crefully and fix the problem focuses on OSD on device.
- Look at the OSD on PVC case.
- Confirm whether this problem is specific to Rook.
- If not, submit an issue to Ceph.
Does my plan make sense?
Thanks for the explanation, sounds like a good plan. When ceph-volume creates the OSD, I thought ceph would start using a symlink with the path name instead of the original device name. I am forgetting the details, but my memory doesn't match what you are describing, so I don't trust my memory.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
I just hit this as well -- and I've seen it a few times in the past, just didn't find a solution or have time to try to track it down. thanks for the efforts to fix it!
@satoru-takeuchi How is the investigation on this issue? Thanks!
@travisn I'm testing #11567 , which resolves this issue. There are several remaining tests. I'll finish this todat.
It takes long time due to lack of my extra time and there are many test case.
Thanks a lot for the fix, I'm just waiting for the next release :)
Thanks a lot for the fix, I'm just waiting for the next release :)
v1.10.12 is out with this fix!