talos
talos copied to clipboard
cryptsetup failing in 1.9.4, successful in 1.8.3
Bug Report
Description
After upgrade from 1.8.3 to 1.9.4, we noticed that encrypted PVs couldn't be mounted anymore on the Talos node that was upgraded. Other nodes (still on 1.8.3) were still capable of mounting the same PVs.
Booted the upgraded node back to 1.8.3, and mounting the volumes worked again. Booted again into 1.9.4, the issue returned. Decided to rollback to 1.8.3 for the time being. I couldn't relate it to any part of the release notes, so therefore this issue. And as in Rook and Ceph versions nothing changed, and they're at their latest versions, I'd expect it to be a Talos issue.
Configuration
Encrypted storageclass using:
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
...
parameters:
...
encrypted: "true"
encryptionKMSID: example-secret-name
...
Logs
Pod event logs:
MountVolume.MountDevice failed for volume "pvc-ba495824-4fd2-470e-a13b-c389d5cb0a60" : rpc error: code = Internal desc = an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd4 luks-rbd-0001-0009-rook-ceph-0000000000000008-e1890dfc-b7e1-4692-a46d-7dc871010d67 --disable-keyring -d /dev/stdin]
csi-rbdplugin logs:
2025-03-05T09:42:11.843928203Z W0305 09:42:11.843768 247407 rbd_attach.go:238] nbd modprobe failed (an error (exit status 1) occurred while running modprobe args: [nbd]): "modprobe: ERROR: could not insert 'nbd': Operation not permitted\n"
2025-03-05T09:42:48.289039584Z E0305 09:42:48.288918 247407 crypto.go:283] ID: 14 Req-ID: 0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 failed to open device "/dev/rbd5" (an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd5 luks-rbd-0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 --disable-keyring -d /dev/stdin]): Failed to open key file.
2025-03-05T09:42:48.289077695Z E0305 09:42:48.288962 247407 encryption.go:300] ID: 14 Req-ID: 0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 failed to open device replicatedpool/csi-vol-b7d268cc-e8bf-450e-8765-f4e3fc125954: an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd5 luks-rbd-0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 --disable-keyring -d /dev/stdin]
2025-03-05T09:42:48.383675741Z E0305 09:42:48.383560 247407 utils.go:271] ID: 14 Req-ID: 0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 GRPC error: rpc error: code = Internal desc = an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd5 luks-rbd-0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 --disable-keyring -d /dev/stdin]
Environment
- Talos version: 1.9.4 / 1.8.3
- Kubernetes version: 1.31.3
- Platform: Rook 1.16.4 / Ceph v19.2.1-20250202
This should have nothing to do with Talos in general, unless you can pin to Talos, as it's your CSI.
No idea if Ceph is using cryptesetup from the host or doesn't, it certainly got updated in Talos 1.9, but if that fails, it would be a bug on Ceph side.
What are the cryptsetup versions used in 1.8.3 and 1.9.4? I can try creating an issue at Ceph side, but I'm not that read into the internals of Talos nor Ceph so I don't know exactly what to ask.
As I said, the issue clearly surfaced after updating Talos, therefore I'd say it is (at least partially) a Talos issue, when Talos doesn't work with the most recent versions of Ceph and/or Rook, I'd at least expect it to be documented in Talos release notes or docs.
Whether something works or doesn't work is a very controversial topic. We can't test all possible software with Talos.
Unless you submitted a test to verify your Ceph setup with Talos, it's not tested, and unless you tested beta versions, we can't really detect such regressions in time.
Talos does have Ceph/Rook integration tests, and they work fully, including encryption. So your issue is either not related to Talos, or includes other flows.
You can find versions via git:
https://github.com/siderolabs/pkgs/blob/release-1.9/Pkgfile#L20
https://github.com/siderolabs/pkgs/blob/release-1.8/Pkgfile#L20
And in fact they haven't changed.
And also you need to verify whether Ceph is using host cryptsetup, or its own bundled version. If it uses its own version, it's totally not a Talos issue directly.
As mentioned in issue #10469, it appears there is a problem with cryptsetup in Talos 1.9.4 (or possibly an earlier version). Longhorn volume encryption does not work in 1.9.4, although it functions correctly in 1.8.4.
As mentioned in issue #10469, it appears there is a problem with cryptsetup in Talos 1.9.4 (or possibly an earlier version). Longhorn volume encryption does not work in 1.9.4, although it functions correctly in 1.8.4.
I have confirmed that this issue has persisted since at least Talos 1.9.1. In v1.9.0, Longhorn volume creation (not encryption) also failed—likely for a different reason. I’m not sure whether the encryption failure is directly related to cryptsetup/dm-crypt or to another component in Talos Linux, but until it is resolved, I have no choice but to remain on v1.8.4.
-
Longhorn relies on host-based cryptsetup, as stated in its documentation (https://longhorn.io/docs/1.8.1/advanced-resources/security/volume-encryption/).
-
The error below was encountered while attempting to encrypt a volume. The entire Longhorn (v1.8.1, latest) configuration remained identical across the different Talos versions (1.8.4, 1.9.0-1.9.4) I tested, and the encryption key was stored in a Kubernetes secret as specified in the official Longhorn documentation. The only variable that appears to affect whether encryption worked is the underlying Talos version.
MountVolume.MountDevice failed for volume "pvc-5229e867-2822-4073-b8f7-0c87b9e7978f" : rpc error: code = Internal desc = failed to encrypt device /dev/longhorn/pvc-5229e867-2822-4073-b8f7-0c87b9e7978f with LUKS: failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/7780/ns/mnt --ipc=/host/proc/7780/ns/ipc cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/longhorn/pvc-5229e867-2822-4073-b8f7-0c87b9e7978f -d /dev/stdin], output , stderr Failed to open key file. : exit status 1"
I too have come across this issue. Would it be possible to add cryptsetup as an extension to talos to allow it to be used by other CSI managers?
cryptsetup is already included in base Talos.
Thank you for responding @smira, but as multiple users have now mentioned that they've come across this issue, do you still stand by your earlier comment that this issue is not at all related to (or caused by) Talos?
As it was explained above, there is no change to cryptsetup version itself, moreover Ceph encryption still works.
There might be some issue, but it's not clear what it is so far, someone needs to debug and figure out what exactly.
Hi smira, The identical problem is being observed on my system, which runs Talos v1.9.3 and Longhorn, configured as follows.
Environment:
- LongHorn: v1.8.0
- TalosOS: v1.9.3
Extension loaded:
- siderolabs/iscsi-tools
- siderolabs/qemu-guest-agent
- siderolabs/util-linux-tools
Storageclass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-crypto-apps
parameters:
csi.storage.k8s.io/node-publish-secret-name: longhorn-crypto
csi.storage.k8s.io/node-publish-secret-namespace: apps
csi.storage.k8s.io/node-stage-secret-name: longhorn-crypto
csi.storage.k8s.io/node-stage-secret-namespace: apps
csi.storage.k8s.io/provisioner-secret-name: longhorn-crypto
csi.storage.k8s.io/provisioner-secret-namespace: apps
encrypted: "true"
fromBackup: ""
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
allowVolumeExpansion: true
provisioner: driver.longhorn.io
reclaimPolicy: Retain
volumeBindingMode: Immediate
Secret:
apiVersion: v1
data:
CRYPTO_KEY_PROVIDER: c2VjcmV0
CRYPTO_KEY_VALUE: <censored>
kind: Secret
metadata:
name: longhorn-crypto
namespace: apps
type: Opaque
When attempting to provision a new volume via that storage class, the output of the kubectl describe command is as follows:
MountVolume.MountDevice failed for volume "pvc-ec8f2ea3-4bb7-445d-bf99-fd8a0a76ddf4" : rpc error: code = Internal desc = failed to encrypt device /dev/longhorn/pvc-ec8f2ea3-4bb7-445d-bf99-fd8a0a76ddf4 with LUKS: failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/247942/ns/mnt --ipc=/host/proc/247942/ns/ipc cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/longhorn/pvc-ec8f2ea3-4bb7-445d-bf99-fd8a0a76ddf4 -d /dev/stdin], output , stderr Failed to open key file.
Should further information be necessary, including potentially from system logs, please do not hesitate to inform me. Thank you
I noticed Talos 1.9.5 was released today. After performing a clean install, I tested the same Longhorn configuration as in my previous tests and confirmed that the encryption failure still persists in v1.9.5. I can retest if a better method is suggested to investigate other potential causes.
I don't quite know what the issue is - from the looks of it, it seems to be a bug in Longhorn.
Like --mount=/host/proc/247942/ns/mnt - what is the PID? It's not PID1, so not sure which mount namespace it even tries to enter? Does it execute host cryptsetup or its own?
The message says 'Failed to open key file', while the key file should be -d /dev/stdin ? The docs on cryptsetup say it should be -d -.
I can easily reproduce this from a privileged pod:
/ # nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d /dev/stdin
Failed to open key file.
/ # nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d -
Enter passphrase for /fff:
/ #
You can see that -d /dev/stdin doesn't work at all.
See cryptsetup man, search for --key-file.
So there's not much we can do on Talos side.
@smira Thank you for your investigation! I have just submitted a bug report to Longhorn. Hopefully, the root cause will be identified soon.
@smira I received the following feedback from c3y1huang of the Longhorn team (https://github.com/longhorn/longhorn/issues/10584).
Thank you for raising this. Did you first notice the issue with Talos v1.9.0?
Replicate the failure manually by running from a privileged pod. Also, note that replacing "-d /dev/stdin" with "-d -" results in a successful execution.
Longhorn uses -d /dev/stdin because -d - requires interactive input. If you run the following command directly for testing, it should fail due to the missing piped input into /dev/stdin:
nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/xvdh -d /dev/stdinThe complete execution is equivalent to:
echo "passphrase" | nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/xvdh -d /dev/stdinhttps://github.com/longhorn/go-common-libs/blob/72871a09bee01932711d953bc0d48a09fa3ddb1e/exec/exec.go#L96-L105
This logic is not specific to Talos; Longhorn uses it for all operating systems. And I don't know why the same Longhorn version works on Talos 1.8.x but not 1.9.x, as IIRC we haven't tested Talos 1.9.x yet. https://longhorn.io/docs/1.8.1/best-practices/#operating-system
While I understand it is the Longhorn team's responsibility to verify compatibility with Talos 1.9.x, I wanted to pass this information along to see if there might still be anything on the Talos side related to this issue.
Thank you for your time again.
Longhorn uses -d /dev/stdin because -d - requires interactive input. If you run the following command directly for testing, it should fail due to the missing piped input into /dev/stdin:
This is not true. The proper way is to follow cryptsetup documentation:
/ # echo "passphrase" | nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --ciphe
r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d /dev/std
in
Failed to open key file.
/ # echo -n "passphrase" | nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --ciphe
r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d -
/ # (successfully encrypted)
The -d /dev/stdin simply doesn't work.
@smira It appears the Longhorn team tested your suggestion (https://github.com/longhorn/longhorn/issues/10584), and now plans to incorporate the solution in their upcoming releases (1.7.4 and 1.8.2) for compatibility with Talos 1.9.x. Thank you again for your help!
Great that the problem is solved now in Longhorn, should a similar issue be created at Rook/Ceph side?
should a similar issue be created at Rook/Ceph side?
The relevant repo is https://github.com/ceph/ceph-csi
https://github.com/ceph/ceph-csi/blob/49d094e3d5abf8c5c4c64c2796ace44e16fd382c/internal/util/cryptsetup/cryptsetup.go#L81 https://github.com/ceph/ceph-csi/blob/49d094e3d5abf8c5c4c64c2796ace44e16fd382c/internal/util/cryptsetup/cryptsetup.go#L95
if i dont forget, I'll create an issue there in the next week
if i dont forget, I'll create an issue there in the next week
I'm running into this issue as well, but using encrypted ceph volumes using ceph-csi. @moo-im-a-cow did you already have a chance to create an issue in https://github.com/ceph/ceph-csi ? 🙏
Longhorn uses -d /dev/stdin because -d - requires interactive input. If you run the following command directly for testing, it should fail due to the missing piped input into /dev/stdin:
This is not true. The proper way is to follow cryptsetup documentation:
/ # echo "passphrase" | nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --ciphe r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d /dev/std in Failed to open key file. / # echo -n "passphrase" | nsenter -t 1 -m cryptsetup -q luksFormat --type luks2 --ciphe r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d - / # (successfully encrypted)The
-d /dev/stdinsimply doesn't work.
@smira just to better understand the issue: if nothing changed between Talos 1.8 and 1.9, any idea on why the encrypted volumes do work on 1.8 but not on 1.9?
@smira just to better understand the issue: if nothing changed between Talos 1.8 and 1.9, any idea on why the encrypted volumes do work on 1.8 but not on 1.9?
I do not know, but at the same time Ceph with encryption works in our integration tests without issues 🤷
I guess this is a very long story, but it also goes from a questionable CSI behavior with shelling out to the host for cryptsetup instead of using its own bundled version that was tested to work properly.
I tried creating the issue at Ceph-side; see https://github.com/ceph/ceph-csi/issues/5306
PR was merged into ceph-csi devel branch and will be included in the next release (latest release is currently 3.14.0, so probably 3.14.1 or 3.15.0). 🥳
For other Rook users, like me, that means:
- Wait for the Ceph CSI release
- Wait until Rook includes the new Ceph CSI release (Rook 1.17.0/1.17.1 contain 3.14.0)
- Upgrade to that Rook release
- Finally upgrading to a Talos release newer than 1.8.x
Just in case it helps someone else, I did a (rather ugly) workaround for this issue with Longhorn. Applied a patch to Longhorn's Helm (which we manage via FluxCD) so that an initContainer patches longhorn-csi-plugin as to replace -d /dev/stdin with -d -. The patch contains a checksum guard to ensure it only patches the known version.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: longhorn-release
[...]
spec:
postRenderers:
- kustomize:
patches:
- target:
group: apps
version: v1
kind: DaemonSet
name: longhorn-csi-plugin
patch: |-
- op: add
path: /spec/template/spec/volumes/-
value:
- name: patched-nsmounter
emptyDir: {}
- op: add
path: /spec/template/spec/containers/name=longhorn-csi-plugin/volumeMounts/-
value:
- name: patched-nsmounter
mountPath: /usr/local/sbin/nsmounter
subPath: nsmounter
- op: add
path: /spec/template/spec/initContainers/-
value:
- name: patch-nsmounter
image: "longhornio/longhorn-manager:v1.8.1"
command: ["/bin/sh","-c"]
volumeMounts:
- name: patched-nsmounter
mountPath: /mnt/nsmounter
args:
- |
# copy original script into our shared volume
cp /usr/local/sbin/nsmounter /mnt/nsmounter/nsmounter
# only patch if the checksum matches
if [ "$(sha256sum /usr/local/sbin/nsmounter | cut -d' ' -f1)" = "dc2cc05398fcbe34768f2753f446ab1a3480d001e2022c4581d3201515835ffb" ]; then
sed -i '$i\
rewrite_args=()\
while [[ $# -gt 0 ]]; do\
if [[ $1 == "-d" && $2 == "/dev/stdin" ]]; then\
rewrite_args+=("-d" "-")\
shift 2\
else\
rewrite_args+=("$1")\
shift\
fi\
done\
set -- "${rewrite_args[@]}"' /mnt/nsmounter/nsmounter
fi
chmod +x /mnt/nsmounter/nsmounter
chart:
spec:
chart: longhorn
reconcileStrategy: ChartVersion
sourceRef:
kind: HelmRepository
name: longhorn-repo
version: v1.8.1
[...]
1.9.0 of Longhorn fixes this issue, we can likely close it.
1.9.0 of Longhorn fixes this issue, we can likely close it.
This issue is for Rook/Ceph, not for Longhorn. I'm glad that Longhorn users have a solution earlier 😆 but I don't quite agree that the issue has been completely resolved yet. And I still can't really understand why this changed between Talos 1.8 and 1.9. I can imagine, for future changes, it'd be useful to find out what caused this to break, even though the current behavior of both Ceph and Longhorn was not 100% according to specs.
Apologies, I'm tracking a few issues around this and misremembered the focus of this one.
Status update: Ceph has backported it into patch release Ceph CSI 3.14.2. Rook has backported it to 1.17, but no release has been made yet (will be 1.17.7).
After that, I can test if the new releases make it possible to upgrade to Talos 1.9 and 1.10 for us.
I tested it by pointing directly to the latest CephCSI image and could upgrade our clusters to 1.9.5 and next 1.10.5 without any issues. 🙌