extensions icon indicating copy to clipboard operation
extensions copied to clipboard

zfs trouble on ARM64: segmentation fault

Open samip5 opened this issue 11 months ago • 6 comments

This is not great.. How would I even go about debugging this as Talos doesn't properly boot as a result?

Running on Oracle Ampere instance.

user: warning: [2025-01-20T09:56:31.57410882Z]: [talos] [initramfs] enabling system extension zfs 2.2.7-v1.9.2
user: warning: [2025-01-20T09:56:32.18043182Z]: [talos] service[ext-zfs-service](Starting): Starting service
user: warning: [2025-01-20T09:56:32.18533482Z]: [talos] service[ext-zfs-service](Waiting): Waiting for service "containerd" to be "up", service "udevd" to be "up", service "cri" to be "up", file "/dev/zfs" to exist
kern: warning: [2025-01-20T09:56:32.64627282Z]: zfs: module license 'CDDL' taints kernel.
kern: warning: [2025-01-20T09:56:32.65022382Z]: zfs: module license taints kernel.
user: warning: [2025-01-20T09:56:33.19103082Z]: [talos] service[ext-zfs-service](Waiting): Waiting for service "containerd" to be "up", service "udevd" to be "up", service "cri" to be registered, file "/dev/zfs" to exist
kern:  notice: [2025-01-20T09:56:33.27346082Z]: ZFS: Loaded module v2.2.7-1, ZFS pool version 5000, ZFS filesystem version 5
user: warning: [2025-01-20T09:56:34.19160382Z]: [talos] service[ext-zfs-service](Waiting): Waiting for service "cri" to be registered
user: warning: [2025-01-20T09:56:34.96757382Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "auditd" to be "up", service "containerd" to be "up", service "cri" to be "up", service "etcd" to be "up", service "ext-iscsid" to be "up", service "ext-tgtd" to be "up", service "ext-zfs-service" to be "up", service "kubelet" to be "up", service "machined" to be "up", service "syslogd" to be "up", service "trustd" to be "up", service "udevd" to be "up"
user: warning: [2025-01-20T09:56:35.19158382Z]: [talos] service[ext-zfs-service](Waiting): Waiting for service "cri" to be "up"
user: warning: [2025-01-20T09:56:35.97000382Z]: [talos] service[ext-zfs-service](Preparing): Running pre state
user: warning: [2025-01-20T09:56:35.97765882Z]: [talos] service[ext-zfs-service](Preparing): Creating service runner
user: warning: [2025-01-20T09:56:36.06776182Z]: [talos] service[ext-zfs-service](Running): Started task ext-zfs-service (PID 5315) for container ext-zfs-service
user: warning: [2025-01-20T09:56:36.52519982Z]: [talos] service[ext-zfs-service](Waiting): Error running Containerd(ext-zfs-service), going to restart until it succeeds: task "ext-zfs-service" failed: exit code 1
user: warning: [2025-01-20T09:56:41.59867282Z]: [talos] service[ext-zfs-service](Running): Started task ext-zfs-service (PID 5621) for container ext-zfs-service

talosctl logs ext-zfs-service:

0 / 0 keys successfully loaded
2025/01/20 09:56:36 zfs-service: zpool import error: signal: segmentation fault
no pools available to import

samip5 avatar Jan 20 '25 10:01 samip5

This suggests the zpool program is crashing. You can spawn a privileged system pod and try to debug zpool, or try to install zpool in that system pod (using the distro’s package manager) and see if that also crashes. I’ve run zfs commands inside pods created by https://github.com/kvaps/kubectl-node-shell .

jfroy avatar Jan 20 '25 15:01 jfroy

The wierd thing is it did manage to mount my pool..

samip5 avatar Jan 20 '25 16:01 samip5

The ZFS binary seems to be segfaulting while zpool binary is fine.

samip5 avatar Jan 22 '25 03:01 samip5

Did you end up finding a solution? I have three Dell R630's and one out of my three nodes is having this same issue when starting up a brand new cluster

DavidIlie avatar Jan 31 '25 00:01 DavidIlie

Did you end up finding a solution? I have three Dell R630's and one out of my three nodes is having this same issue when starting up a brand new cluster

It managed to mount the pool so I dunno what the problem was about and am able to schedule pods and things.

samip5 avatar Jan 31 '25 00:01 samip5

Hello. I get the same error from "talosctl logs ext-zfs-service" on a Raspberry Pi 4. Both on Talos 1.9.1 and 1.9.2. Segfaults ain’t fun.

If I could get some pointers I’d love to help out, sharing some logs etc.

simlun avatar Feb 02 '25 18:02 simlun

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 02 '25 02:08 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Aug 07 '25 02:08 github-actions[bot]