talos iSCSI broken after upgrading from Talos 1.11.2 to 1.11.3

Bug Report

I have a Talos cluster configured with the QNAP Trident CSI to connect to our SAN for iSCSI LUNs. In Talos 1.11.2, this was working great. After upgrading to 1.11.3, iSCSI connectivity appeared to degrade over a matter of days, and after several additional days of troubleshooting I believe it to be related to the 1.11.3 upgrade.

Description

I used Imagefactory to get an installer-hash with my extensions of iscsi-tools and bnx2-bnx2x. My upgrade-command from 1.11.2 was:

 talosctl upgrade \
  --nodes <node-IP> \
  --image factory.talos.dev/installer/d3201e2a3a6d3b9d1a10385ad6e86aac7b96c048c02e7f2e2c282c2654a36187:v1.11.3 \
  --preserve

Afterwards, the extensions show as installed on each node:

❯ talosctl get extensions -n 10.50.10.100
NODE           NAMESPACE   TYPE              ID   VERSION   NAME          VERSION
10.50.10.100   runtime     ExtensionStatus   0    1         iscsi-tools   v0.2.0
10.50.10.100   runtime     ExtensionStatus   1    1         bnx2-bnx2x    20250917
10.50.10.100   runtime     ExtensionStatus   2    1         schematic     d3201e2a3a6d3b9d1a10385ad6e86aac7b96c048c02e7f2e2c282c2654a36187

ext-iscsid claims to be running:

❯ talosctl -n 10.50.10.100 get services | grep iscsi
NODE           NAMESPACE   TYPE      ID           VERSION   RUNNING   HEALTHY   HEALTH UNKNOWN
10.50.10.100   runtime     Service   ext-iscsid   1         true      false     true

and there are dmesg outputs around 'ext-iscsid':

❯ talosctl -n 10.50.10.100 dmesg  | grep -i iscsi
10.50.10.100: kern:    info: [2025-11-02T21:03:10.349644671Z]: Loading iSCSI transport class v2.0-870.
10.50.10.100: kern:  notice: [2025-11-02T21:03:10.349851671Z]: iscsi: registered transport (tcp)
10.50.10.100: user: warning: [2025-11-02T21:03:10.776783671Z]: [talos] [initramfs] enabling system extension iscsi-tools v0.2.0
10.50.10.100: user: warning: [2025-11-02T21:03:11.159932671Z]: [talos] service[ext-iscsid](Starting): Starting service
10.50.10.100: user: warning: [2025-11-02T21:03:11.160048671Z]: [talos] service[ext-iscsid](Waiting): Waiting for service "containerd" to be "up", service "cri" to be "up", network, file "/etc/iscsi/initiatorname.iscsi" to exist
10.50.10.100: user: warning: [2025-11-02T21:03:12.160566671Z]: [talos] service[ext-iscsid](Waiting): Waiting for service "containerd" to be "up", service "cri" to be registered, network
10.50.10.100: user: warning: [2025-11-02T21:03:13.160891671Z]: [talos] service[ext-iscsid](Waiting): Waiting for service "cri" to be registered, network
10.50.10.100: user: warning: [2025-11-02T21:03:14.146416671Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "auditd" to be "up", service "containerd" to be "up", service "cri" to be "up", service "dashboard" to be "up", service "ext-iscsid" to be "up", service "kubelet" to be "up", service "machined" to be "up", service "syslogd" to be "up", service "udevd" to be "up"
10.50.10.100: user: warning: [2025-11-02T21:03:15.160950671Z]: [talos] service[ext-iscsid](Waiting): Waiting for service "cri" to be "up", network
10.50.10.100: user: warning: [2025-11-02T21:03:20.162413671Z]: [talos] service[ext-iscsid](Waiting): Waiting for service "cri" to be "up"
10.50.10.100: user: warning: [2025-11-02T21:03:20.945728671Z]: [talos] service[ext-iscsid](Preparing): Running pre state
10.50.10.100: user: warning: [2025-11-02T21:03:20.946124671Z]: [talos] service[ext-iscsid](Preparing): Creating service runner
10.50.10.100: user: warning: [2025-11-02T21:03:21.093087671Z]: [talos] service[ext-iscsid](Running): Started task ext-iscsid (PID 2567) for container ext-iscsid

That said, the iSCSI modules do not appear to load on my worker-nodes:

❯ talosctl -n 10.50.10.100 read /proc/modules
iTCO_wdt 12288 0 - Live 0xffffffffc054f000
iTCO_vendor_support 12288 1 iTCO_wdt, Live 0xffffffffc04d4000
watchdog 40960 1 iTCO_wdt, Live 0xffffffffc0445000
igb 307200 0 - Live 0xffffffffc04e9000
i2c_i801 36864 0 - Live 0xffffffffc04d6000
i2c_algo_bit 16384 1 igb, Live 0xffffffffc03fd000
e1000e 315392 0 - Live 0xffffffffc046d000
ahci 45056 1 - Live 0xffffffffc0456000
i2c_smbus 12288 1 i2c_i801, Live 0xffffffffc044c000
lpc_ich 28672 0 - Live 0xffffffffc043b000
libahci 49152 1 ahci, Live 0xffffffffc042a000
mfd_core 12288 1 lpc_ich, Live 0xffffffffc03f0000
intel_pmc_core 114688 0 - Live 0xffffffffc040b000
intel_vsec 12288 1 intel_pmc_core, Live 0xffffffffc0400000
pmt_telemetry 12288 1 intel_pmc_core, Live 0xffffffffc03f5000
pmt_class 12288 1 pmt_telemetry, Live 0xffffffffc03ea000
❯ talosctl -n 10.50.10.100 read /proc/modules  | grep -i iscsi
❯ talosctl -n 10.50.10.101 read /proc/modules  | grep -i iscsi
❯ talosctl -n 10.50.10.102 read /proc/modules  | grep -i iscsi
❯ talosctl -n 10.50.10.103 read /proc/modules  | grep -i iscsi

I've deployed the 'netshoot' troubleshooting pod with hostNetwork: true and the iscsi tools installed (see gist here, and noticed that a iscsiadm -m discovery -t st -p <iscsi-target-ip> will hang indefinitely. However, more low-level connectivity tests (ping to the iscsi target IP, netcat to port 3260, etc) succeed.

I have put a test machine (generic Ubuntu 24.04 LTS) on the same network as the worker-IPs, installed iscsi-tools, and that same command returns my LUNs in less than a second, which is why I believe the root cause is on the Talos worker nodes. This was only a problem after I upgraded from 1.11.2 to 1.11.3, so I believe I've either upgraded improperly, or done something else strange that is causing this behavior. Is there additional configuration required for the Talos iscsid service in 1.11.3?

Environment

Talos version: 1.11.3
Kubernetes version: 1.34.1
Platform: bare-metal

Nov 02 '25 22:11 law

iSCSI support is compiled into the Linux kernel, so it doesn't need a module. As for the failure, I guess you need to investigate more to find the root cause.

Nov 03 '25 09:11 smira

Thanks for the quick reply! I appreciate the clarification that iSCSI support is compiled into the kernel. To help narrow down the issue, I've already verified:

Low-level network connectivity (ping, netcat to port 3260) works fine from the Talos nodes to the iSCSI target
The same iscsiadm discovery command completes successfully in under a second from a separate Ubuntu 24.04 machine on the same network
The ext-iscsid service reports as running on all nodes
The behavior changed immediately after upgrading from 1.11.2 to 1.11.3 (no other configuration changes)

Since direct shell access isn't available on Talos nodes, I've been testing via a hostNetwork-enabled pod with iSCSI tools, which is where I'm observing the iscsiadm -m discovery command hanging indefinitely.

Given that this appears to have started after the .2 to .3 upgrade, would you happen to know if there were any networking, containerd, or iSCSI-related changes between these versions that might affect iSCSI discovery? Additionally, are there any Talos-specific logs or kernel parameters I should examine that might reveal why the discovery is timing out despite basic connectivity working?

I'm happy to provide any additional diagnostic information that would be helpful!

Nov 03 '25 14:11 law

I don't think it's fair to test with iscsiadm within the container. A proper way would be to nsenter mount namespace of PID1 and run iscsiadm from the host (Talos version of it).

Nov 03 '25 15:11 smira

well... progress. I had to get the debug-container running with the following:

 kubectl run debug-$(date +%s) -it --rm --restart=Never --image=alpine --overrides='
{
  "spec": {
    "hostPID": true,
    "hostNetwork": true,
    "containers": [
      {
        "name": "debug",
        "image": "alpine",
        "command": ["sh"],
        "stdin": true,
        "tty": true,
        "securityContext": {
          "privileged": true
        }
      }
    ],
    "nodeName": "talos-3cz-gat"
  }
}'

but using nsenter with PID-1 as you suggested allowed me to run iscsiadm with:

/ # nsenter --mount="/proc/1/ns/mnt" --net="/proc/1/ns/net" -- /usr/local/sbin/iscsiadm -m discovery -t st -p 10.50.10.5
10.50.10.5:3260,1 iqn.2004-04.com.qnap:ts-873aeu-rp:iscsi.iscsi-trident-pvc-6f75b505-5dbb-4d8f
10.50.10.5:3260,1 iqn.2004-04.com.qnap:ts-873aeu-rp:iscsi.iscsi-trident-pvc-a64b1f46-8ad8-43f8
<snip>
10.50.10.5:3260,1 iqn.2004-04.com.qnap:ts-873aeu-rp:iscsi.iscsi-trident-pvc-1fc3189e-eebf-4167-a7b5-6bbc3ebb8433.7b359e``` 

Which is a lot further than I got before, thank you.  What else could be stopping my CSI from getting these PVCs across to their respective pods?

Nov 03 '25 15:11 law

Much easier way:

kubectl debug -it node/<node> --image alpine --profile=sysadmin -n kube-system

...

nsenter -t 1 -m ....

But if iscsiadm works, you need to look into your CSI.

Nov 03 '25 16:11 smira

Huzzah, lol. Don't suppose this CSI is anywhere in the Talos testing lineup, is it? Maybe on a list somewhere with a "ah yes, twiddle parameter X to get this to work!"?

Nov 03 '25 16:11 law

We don't have it in the test suite, but we are open to contributions.

Nov 03 '25 17:11 smira

How might a curious mind go about contributing to such an endeavor? A link to a repo, maybe a 'contributor guideline FAQ', etc, would be most prized!

Nov 03 '25 17:11 law

All integration tests are in this repo.

We have e.g. Longhorn/iSCSI test which works, https://github.com/siderolabs/talos/blob/4c095281be93cb11290eb43f60b4cc1a168bef17/internal/integration/k8s/longhorn.go#L53-L54

Nov 03 '25 18:11 smira