talos icon indicating copy to clipboard operation
talos copied to clipboard

cryptsetup failing in 1.9.4, successful in 1.8.3

Open bartlaarhoven opened this issue 9 months ago • 28 comments

Bug Report

Description

After upgrade from 1.8.3 to 1.9.4, we noticed that encrypted PVs couldn't be mounted anymore on the Talos node that was upgraded. Other nodes (still on 1.8.3) were still capable of mounting the same PVs.

Booted the upgraded node back to 1.8.3, and mounting the volumes worked again. Booted again into 1.9.4, the issue returned. Decided to rollback to 1.8.3 for the time being. I couldn't relate it to any part of the release notes, so therefore this issue. And as in Rook and Ceph versions nothing changed, and they're at their latest versions, I'd expect it to be a Talos issue.

Configuration

Encrypted storageclass using:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
...
parameters:
...
    encrypted: "true"
    encryptionKMSID: example-secret-name
...

Logs

Pod event logs:

MountVolume.MountDevice failed for volume "pvc-ba495824-4fd2-470e-a13b-c389d5cb0a60" : rpc error: code = Internal desc = an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd4 luks-rbd-0001-0009-rook-ceph-0000000000000008-e1890dfc-b7e1-4692-a46d-7dc871010d67 --disable-keyring -d /dev/stdin]

csi-rbdplugin logs:

2025-03-05T09:42:11.843928203Z W0305 09:42:11.843768  247407 rbd_attach.go:238] nbd modprobe failed (an error (exit status 1) occurred while running modprobe args: [nbd]): "modprobe: ERROR: could not insert 'nbd': Operation not permitted\n"
2025-03-05T09:42:48.289039584Z E0305 09:42:48.288918  247407 crypto.go:283] ID: 14 Req-ID: 0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 failed to open device "/dev/rbd5" (an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd5 luks-rbd-0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 --disable-keyring -d /dev/stdin]): Failed to open key file.
2025-03-05T09:42:48.289077695Z E0305 09:42:48.288962  247407 encryption.go:300] ID: 14 Req-ID: 0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 failed to open device replicatedpool/csi-vol-b7d268cc-e8bf-450e-8765-f4e3fc125954: an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd5 luks-rbd-0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 --disable-keyring -d /dev/stdin]
2025-03-05T09:42:48.383675741Z E0305 09:42:48.383560  247407 utils.go:271] ID: 14 Req-ID: 0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 GRPC error: rpc error: code = Internal desc = an error (exit status 1) occurred while running cryptsetup args: [luksOpen /dev/rbd5 luks-rbd-0001-0009-rook-ceph-0000000000000008-b7d268cc-e8bf-450e-8765-f4e3fc125954 --disable-keyring -d /dev/stdin]

Environment

  • Talos version: 1.9.4 / 1.8.3
  • Kubernetes version: 1.31.3
  • Platform: Rook 1.16.4 / Ceph v19.2.1-20250202

bartlaarhoven avatar Mar 05 '25 10:03 bartlaarhoven

This should have nothing to do with Talos in general, unless you can pin to Talos, as it's your CSI.

No idea if Ceph is using cryptesetup from the host or doesn't, it certainly got updated in Talos 1.9, but if that fails, it would be a bug on Ceph side.

smira avatar Mar 05 '25 11:03 smira

What are the cryptsetup versions used in 1.8.3 and 1.9.4? I can try creating an issue at Ceph side, but I'm not that read into the internals of Talos nor Ceph so I don't know exactly what to ask.

As I said, the issue clearly surfaced after updating Talos, therefore I'd say it is (at least partially) a Talos issue, when Talos doesn't work with the most recent versions of Ceph and/or Rook, I'd at least expect it to be documented in Talos release notes or docs.

bartlaarhoven avatar Mar 05 '25 11:03 bartlaarhoven

Whether something works or doesn't work is a very controversial topic. We can't test all possible software with Talos.

Unless you submitted a test to verify your Ceph setup with Talos, it's not tested, and unless you tested beta versions, we can't really detect such regressions in time.

Talos does have Ceph/Rook integration tests, and they work fully, including encryption. So your issue is either not related to Talos, or includes other flows.

You can find versions via git:

https://github.com/siderolabs/pkgs/blob/release-1.9/Pkgfile#L20

https://github.com/siderolabs/pkgs/blob/release-1.8/Pkgfile#L20

And in fact they haven't changed.

smira avatar Mar 05 '25 12:03 smira

And also you need to verify whether Ceph is using host cryptsetup, or its own bundled version. If it uses its own version, it's totally not a Talos issue directly.

smira avatar Mar 05 '25 12:03 smira

As mentioned in issue #10469, it appears there is a problem with cryptsetup in Talos 1.9.4 (or possibly an earlier version). Longhorn volume encryption does not work in 1.9.4, although it functions correctly in 1.8.4.

shelumiel avatar Mar 08 '25 07:03 shelumiel

As mentioned in issue #10469, it appears there is a problem with cryptsetup in Talos 1.9.4 (or possibly an earlier version). Longhorn volume encryption does not work in 1.9.4, although it functions correctly in 1.8.4.

I have confirmed that this issue has persisted since at least Talos 1.9.1. In v1.9.0, Longhorn volume creation (not encryption) also failed—likely for a different reason. I’m not sure whether the encryption failure is directly related to cryptsetup/dm-crypt or to another component in Talos Linux, but until it is resolved, I have no choice but to remain on v1.8.4.

  • Longhorn relies on host-based cryptsetup, as stated in its documentation (https://longhorn.io/docs/1.8.1/advanced-resources/security/volume-encryption/).

  • The error below was encountered while attempting to encrypt a volume. The entire Longhorn (v1.8.1, latest) configuration remained identical across the different Talos versions (1.8.4, 1.9.0-1.9.4) I tested, and the encryption key was stored in a Kubernetes secret as specified in the official Longhorn documentation. The only variable that appears to affect whether encryption worked is the underlying Talos version.

MountVolume.MountDevice failed for volume "pvc-5229e867-2822-4073-b8f7-0c87b9e7978f" : rpc error: code = Internal desc = failed to encrypt device /dev/longhorn/pvc-5229e867-2822-4073-b8f7-0c87b9e7978f with LUKS: failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/7780/ns/mnt --ipc=/host/proc/7780/ns/ipc cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/longhorn/pvc-5229e867-2822-4073-b8f7-0c87b9e7978f -d /dev/stdin], output , stderr Failed to open key file. : exit status 1"

shelumiel avatar Mar 08 '25 17:03 shelumiel

I too have come across this issue. Would it be possible to add cryptsetup as an extension to talos to allow it to be used by other CSI managers?

lededje avatar Mar 12 '25 09:03 lededje

cryptsetup is already included in base Talos.

smira avatar Mar 12 '25 10:03 smira

Thank you for responding @smira, but as multiple users have now mentioned that they've come across this issue, do you still stand by your earlier comment that this issue is not at all related to (or caused by) Talos?

bartlaarhoven avatar Mar 12 '25 10:03 bartlaarhoven

As it was explained above, there is no change to cryptsetup version itself, moreover Ceph encryption still works.

There might be some issue, but it's not clear what it is so far, someone needs to debug and figure out what exactly.

smira avatar Mar 12 '25 10:03 smira

Hi smira, The identical problem is being observed on my system, which runs Talos v1.9.3 and Longhorn, configured as follows.

Environment:

  • LongHorn: v1.8.0
  • TalosOS: v1.9.3

Extension loaded:

  • siderolabs/iscsi-tools
  • siderolabs/qemu-guest-agent
  • siderolabs/util-linux-tools

Storageclass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-crypto-apps
parameters:
  csi.storage.k8s.io/node-publish-secret-name: longhorn-crypto
  csi.storage.k8s.io/node-publish-secret-namespace: apps
  csi.storage.k8s.io/node-stage-secret-name: longhorn-crypto
  csi.storage.k8s.io/node-stage-secret-namespace: apps
  csi.storage.k8s.io/provisioner-secret-name: longhorn-crypto
  csi.storage.k8s.io/provisioner-secret-namespace: apps
  encrypted: "true"
  fromBackup: ""
  numberOfReplicas: "2"
  staleReplicaTimeout: "2880"
allowVolumeExpansion: true
provisioner: driver.longhorn.io
reclaimPolicy: Retain
volumeBindingMode: Immediate

Secret:

apiVersion: v1
data:
  CRYPTO_KEY_PROVIDER: c2VjcmV0
  CRYPTO_KEY_VALUE: <censored>
kind: Secret
metadata:
  name: longhorn-crypto
  namespace: apps
type: Opaque

When attempting to provision a new volume via that storage class, the output of the kubectl describe command is as follows:

MountVolume.MountDevice failed for volume "pvc-ec8f2ea3-4bb7-445d-bf99-fd8a0a76ddf4" : rpc error: code = Internal desc = failed to encrypt device /dev/longhorn/pvc-ec8f2ea3-4bb7-445d-bf99-fd8a0a76ddf4 with LUKS: failed to execute: /usr/bin/nsenter [nsenter --mount=/host/proc/247942/ns/mnt --ipc=/host/proc/247942/ns/ipc cryptsetup -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/longhorn/pvc-ec8f2ea3-4bb7-445d-bf99-fd8a0a76ddf4 -d /dev/stdin], output , stderr Failed to open key file.

Should further information be necessary, including potentially from system logs, please do not hesitate to inform me. Thank you

asbarbati avatar Mar 12 '25 23:03 asbarbati

I noticed Talos 1.9.5 was released today. After performing a clean install, I tested the same Longhorn configuration as in my previous tests and confirmed that the encryption failure still persists in v1.9.5. I can retest if a better method is suggested to investigate other potential causes.

shelumiel avatar Mar 13 '25 04:03 shelumiel

I don't quite know what the issue is - from the looks of it, it seems to be a bug in Longhorn.

Like --mount=/host/proc/247942/ns/mnt - what is the PID? It's not PID1, so not sure which mount namespace it even tries to enter? Does it execute host cryptsetup or its own?

The message says 'Failed to open key file', while the key file should be -d /dev/stdin ? The docs on cryptsetup say it should be -d -.

I can easily reproduce this from a privileged pod:

/ # nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d /dev/stdin
Failed to open key file.
/ # nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1 -d -
Enter passphrase for /fff:
/ #

You can see that -d /dev/stdin doesn't work at all.

See cryptsetup man, search for --key-file.

So there's not much we can do on Talos side.

smira avatar Mar 13 '25 10:03 smira

@smira Thank you for your investigation! I have just submitted a bug report to Longhorn. Hopefully, the root cause will be identified soon.

shelumiel avatar Mar 13 '25 13:03 shelumiel

@smira I received the following feedback from c3y1huang of the Longhorn team (https://github.com/longhorn/longhorn/issues/10584).

Thank you for raising this. Did you first notice the issue with Talos v1.9.0?

Replicate the failure manually by running from a privileged pod. Also, note that replacing "-d /dev/stdin" with "-d -" results in a successful execution.

Longhorn uses -d /dev/stdin because -d - requires interactive input. If you run the following command directly for testing, it should fail due to the missing piped input into /dev/stdin:

nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/xvdh -d /dev/stdin

The complete execution is equivalent to:

echo "passphrase" | nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --cipher aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/xvdh -d /dev/stdin

https://github.com/longhorn/go-common-libs/blob/72871a09bee01932711d953bc0d48a09fa3ddb1e/exec/exec.go#L96-L105

This logic is not specific to Talos; Longhorn uses it for all operating systems. And I don't know why the same Longhorn version works on Talos 1.8.x but not 1.9.x, as IIRC we haven't tested Talos 1.9.x yet. https://longhorn.io/docs/1.8.1/best-practices/#operating-system

While I understand it is the Longhorn team's responsibility to verify compatibility with Talos 1.9.x, I wanted to pass this information along to see if there might still be anything on the Talos side related to this issue.

Thank you for your time again.

shelumiel avatar Mar 14 '25 01:03 shelumiel

Longhorn uses -d /dev/stdin because -d - requires interactive input. If you run the following command directly for testing, it should fail due to the missing piped input into /dev/stdin:

This is not true. The proper way is to follow cryptsetup documentation:

/ # echo "passphrase" | nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --ciphe
r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1  -d /dev/std
in
Failed to open key file.
/ # echo -n "passphrase" | nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --ciphe
r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1  -d -
/ # (successfully encrypted)

The -d /dev/stdin simply doesn't work.

smira avatar Mar 14 '25 09:03 smira

@smira It appears the Longhorn team tested your suggestion (https://github.com/longhorn/longhorn/issues/10584), and now plans to incorporate the solution in their upcoming releases (1.7.4 and 1.8.2) for compatibility with Talos 1.9.x. Thank you again for your help!

shelumiel avatar Mar 19 '25 03:03 shelumiel

Great that the problem is solved now in Longhorn, should a similar issue be created at Rook/Ceph side?

bartlaarhoven avatar Mar 19 '25 09:03 bartlaarhoven

should a similar issue be created at Rook/Ceph side?

The relevant repo is https://github.com/ceph/ceph-csi

https://github.com/ceph/ceph-csi/blob/49d094e3d5abf8c5c4c64c2796ace44e16fd382c/internal/util/cryptsetup/cryptsetup.go#L81 https://github.com/ceph/ceph-csi/blob/49d094e3d5abf8c5c4c64c2796ace44e16fd382c/internal/util/cryptsetup/cryptsetup.go#L95

if i dont forget, I'll create an issue there in the next week

moo-im-a-cow avatar Apr 09 '25 10:04 moo-im-a-cow

if i dont forget, I'll create an issue there in the next week

I'm running into this issue as well, but using encrypted ceph volumes using ceph-csi. @moo-im-a-cow did you already have a chance to create an issue in https://github.com/ceph/ceph-csi ? 🙏

dedene avatar Apr 23 '25 08:04 dedene

Longhorn uses -d /dev/stdin because -d - requires interactive input. If you run the following command directly for testing, it should fail due to the missing piped input into /dev/stdin:

This is not true. The proper way is to follow cryptsetup documentation:

/ # echo "passphrase" | nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --ciphe
r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1  -d /dev/std
in
Failed to open key file.
/ # echo -n "passphrase" | nsenter -t 1 -m cryptsetup  -q luksFormat --type luks2 --ciphe
r aes-xts-plain64 --hash sha256 --key-size 256 --pbkdf argon2i /dev/loop1  -d -
/ # (successfully encrypted)

The -d /dev/stdin simply doesn't work.

@smira just to better understand the issue: if nothing changed between Talos 1.8 and 1.9, any idea on why the encrypted volumes do work on 1.8 but not on 1.9?

dedene avatar Apr 23 '25 09:04 dedene

@smira just to better understand the issue: if nothing changed between Talos 1.8 and 1.9, any idea on why the encrypted volumes do work on 1.8 but not on 1.9?

I do not know, but at the same time Ceph with encryption works in our integration tests without issues 🤷

I guess this is a very long story, but it also goes from a questionable CSI behavior with shelling out to the host for cryptsetup instead of using its own bundled version that was tested to work properly.

smira avatar Apr 23 '25 10:04 smira

I tried creating the issue at Ceph-side; see https://github.com/ceph/ceph-csi/issues/5306

bartlaarhoven avatar May 06 '25 10:05 bartlaarhoven

PR was merged into ceph-csi devel branch and will be included in the next release (latest release is currently 3.14.0, so probably 3.14.1 or 3.15.0). 🥳

For other Rook users, like me, that means:

  1. Wait for the Ceph CSI release
  2. Wait until Rook includes the new Ceph CSI release (Rook 1.17.0/1.17.1 contain 3.14.0)
  3. Upgrade to that Rook release
  4. Finally upgrading to a Talos release newer than 1.8.x

bartlaarhoven avatar May 10 '25 11:05 bartlaarhoven

Just in case it helps someone else, I did a (rather ugly) workaround for this issue with Longhorn. Applied a patch to Longhorn's Helm (which we manage via FluxCD) so that an initContainer patches longhorn-csi-plugin as to replace -d /dev/stdin with -d -. The patch contains a checksum guard to ensure it only patches the known version.

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: longhorn-release
[...]
spec:
  postRenderers:
    - kustomize:
        patches:
          - target:
              group: apps
              version: v1
              kind: DaemonSet
              name: longhorn-csi-plugin
            patch: |-
              - op: add
                path: /spec/template/spec/volumes/-
                value:
                  - name: patched-nsmounter
                    emptyDir: {}
              - op: add
                path: /spec/template/spec/containers/name=longhorn-csi-plugin/volumeMounts/-
                value:
                  - name: patched-nsmounter
                    mountPath: /usr/local/sbin/nsmounter
                    subPath: nsmounter
              - op: add
                path: /spec/template/spec/initContainers/-
                value:
                  - name: patch-nsmounter
                    image: "longhornio/longhorn-manager:v1.8.1"
                    command: ["/bin/sh","-c"]
                    volumeMounts:
                      - name: patched-nsmounter
                        mountPath: /mnt/nsmounter
                    args:
                      - |
                        # copy original script into our shared volume
                        cp /usr/local/sbin/nsmounter /mnt/nsmounter/nsmounter

                        # only patch if the checksum matches
                        if [ "$(sha256sum /usr/local/sbin/nsmounter | cut -d' ' -f1)" = "dc2cc05398fcbe34768f2753f446ab1a3480d001e2022c4581d3201515835ffb" ]; then
                          sed -i '$i\
                          rewrite_args=()\
                          while [[ $# -gt 0 ]]; do\
                          if [[ $1 == "-d" && $2 == "/dev/stdin" ]]; then\
                            rewrite_args+=("-d" "-")\
                            shift 2\
                          else\
                            rewrite_args+=("$1")\
                            shift\
                          fi\
                          done\
                          set -- "${rewrite_args[@]}"' /mnt/nsmounter/nsmounter
                        fi

                        chmod +x /mnt/nsmounter/nsmounter
  chart:
    spec:
      chart: longhorn
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: longhorn-repo
      version: v1.8.1
[...]

ianatha avatar May 18 '25 21:05 ianatha

1.9.0 of Longhorn fixes this issue, we can likely close it.

lededje avatar May 29 '25 12:05 lededje

1.9.0 of Longhorn fixes this issue, we can likely close it.

This issue is for Rook/Ceph, not for Longhorn. I'm glad that Longhorn users have a solution earlier 😆 but I don't quite agree that the issue has been completely resolved yet. And I still can't really understand why this changed between Talos 1.8 and 1.9. I can imagine, for future changes, it'd be useful to find out what caused this to break, even though the current behavior of both Ceph and Longhorn was not 100% according to specs.

bartlaarhoven avatar May 29 '25 12:05 bartlaarhoven

Apologies, I'm tracking a few issues around this and misremembered the focus of this one.

lededje avatar Jun 03 '25 10:06 lededje

Status update: Ceph has backported it into patch release Ceph CSI 3.14.2. Rook has backported it to 1.17, but no release has been made yet (will be 1.17.7).

After that, I can test if the new releases make it possible to upgrade to Talos 1.9 and 1.10 for us.

bartlaarhoven avatar Jul 22 '25 21:07 bartlaarhoven

I tested it by pointing directly to the latest CephCSI image and could upgrade our clusters to 1.9.5 and next 1.10.5 without any issues. 🙌

dedene avatar Jul 23 '25 09:07 dedene