talos icon indicating copy to clipboard operation
talos copied to clipboard

UserVolumeConfig dont format device correctly

Open Syntax3rror404 opened this issue 6 months ago • 3 comments

Bug Report

Talos dont format a wiped disk with the command talosctl wipe disk nvme0n1 --drop-partition -n 192.168.35.31 disk correctly.

---
apiVersion: v1alpha1
kind: UserVolumeConfig
name: longhorn
provisioning:
  diskSelector:
    match: disk.transport == "nvme" && disk.model == "WD_BLACK SN770 2TB" && !system_disk
  maxSize: 1800GB
filesystem:
  type: xfs

Expected a volume with 1800GB ~ 1.64 Ti but get 187GB volume see logs.

Description

Matched disk: disk.transport == "nvme" && disk.model == "WD_BLACK SN770 2TB" && !system_disk

talosctl get disk nvme0n1 -n 192.168.35.31 -o yaml
node: 192.168.35.31
metadata:
    namespace: runtime
    type: Disks.block.talos.dev
    id: nvme0n1
    version: 2
    owner: block.DisksController
    phase: running
    created: 2025-06-15T00:16:34Z
    updated: 2025-06-15T00:16:38Z
spec:
    dev_path: /dev/nvme0n1
    size: 2000398934016
    pretty_size: 2.0 TB
    io_size: 512
    sector_size: 512
    readonly: false
    cdrom: false
    model: WD_BLACK SN770 2TB
    serial: "<Redacted>"
    wwid: eui.e8238fa6bf530001001b448b47fe9711
    uuid: e8238fa6-bf53-0001-001b-448b47fe9711
    bus_path: /pci0000:00/0000:00:02.3/0000:03:00.0/nvme
    sub_system: /sys/class/block
    transport: nvme
    symlinks:
        - /dev/disk/by-diskseq/16
        - /dev/disk/by-id/nvme-WD_BLACK_SN770_2TB_<Redacted>
        - /dev/disk/by-id/nvme-WD_BLACK_SN770_2TB_<Redacted>
        - /dev/disk/by-id/nvme-eui.e8238fa6bf530001001b448b47fe9711
        - /dev/disk/by-path/pci-0000:03:00.0-nvme-1

Logs

 user: warning: [2025-06-15T13:52:38.632451599Z]: [talos] locking block device "nvme0n1"
 user: warning: [2025-06-15T13:52:38.632468599Z]: [talos] wiping block device "nvme0n1" with fast method
 user: warning: [2025-06-15T13:52:39.368622599Z]: [talos] block device "nvme0n1" wiped by ranges: 0-1024, 2000398933504-2000398934016
 user: warning: [2025-06-15T13:52:40.766573599Z]: [talos] locking block device "nvme0n1"
 user: warning: [2025-06-15T13:52:40.766599599Z]: [talos] wiping block device "nvme0n1" with fast method
 user: warning: [2025-06-15T13:52:40.769034599Z]: [talos] block device "nvme0n1" wiped with fast method
 user: warning: [2025-06-15T13:55:53.807744599Z]: [talos] locking block device "nvme0n1"
 user: warning: [2025-06-15T13:55:53.807762599Z]: [talos] wiping block device "nvme0n1" with fast method
 user: warning: [2025-06-15T13:55:54.293659599Z]: [talos] block device "nvme0n1" wiped with fast method
 user: warning: [2025-06-15T13:56:45.609145599Z]: [talos] apply config request: mode auto(no_reboot)
 kern:  notice: [2025-06-15T13:56:45.618423599Z]: XFS (nvme1n1p5): Mounting V5 Filesystem 43650210-278f-4f0a-91a5-cbd31fd69f3a
 kern:    info: [2025-06-15T13:56:45.624695599Z]: XFS (nvme1n1p5): Ending clean mount
 user: warning: [2025-06-15T13:56:45.625656599Z]: [talos] volume mount {"component": "controller-runtime", "controller": "block.MountController", "volume": "STATE", "source": "/dev/nvme1n1p5", "target": "/
 system/state", "filesystem": "xfs", "read_only": false}
 user: warning: [2025-06-15T13:56:45.626514599Z]: [talos] machine configuration persisted to STATE {"component": "controller-runtime", "controller": "config.PersistenceController"}
 kern:  notice: [2025-06-15T13:56:45.633014599Z]: XFS (nvme1n1p5): Unmounting Filesystem 43650210-278f-4f0a-91a5-cbd31fd69f3a
 user: warning: [2025-06-15T13:56:45.701463599Z]: [talos] volume unmount {"component": "controller-runtime", "controller": "block.MountController", "volume": "STATE", "source": "/dev/nvme1n1p5", "target": "/
 system/state", "filesystem": "xfs"}
 user: warning: [2025-06-15T13:56:45.713874599Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "u-longhorn", "phase": "waiting -> failed",
 "error": "error creating partition: error writing GPT: failed to delete partition 1: device or resource busy"}
 user: warning: [2025-06-15T13:56:45.859510599Z]: [talos] partition created {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "u-longhorn", "disk": "/dev/nvme0n1",
 "partition": 2, "label": "u-longhorn", "size": "187 GiB"}
 user: warning: [2025-06-15T13:56:45.962268599Z]: [talos] formatting filesystem {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "u-longhorn", "device": "/dev/
 nvme0n1p2", "filesystem": "xfs"}
 user: warning: [2025-06-15T13:56:46.085825599Z]: [talos] volume status {"component": "controller-runtime", "controller": "block.VolumeManagerController", "volume": "u-longhorn", "phase": "failed -> ready",
 "location": "/dev/nvme0n1p2", "parentLocation": "/dev/nvme0n1"}
 kern:  notice: [2025-06-15T13:56:46.087349599Z]: XFS (nvme0n1p2): Mounting V5 Filesystem 47b3900a-1794-44cf-a282-f7b14b14d820
 kern:    info: [2025-06-15T13:56:46.094907599Z]: XFS (nvme0n1p2): Ending clean mount
 user: warning: [2025-06-15T13:56:46.096794599Z]: [talos] volume mount {"component": "controller-runtime", "controller": "block.MountController", "volume": "u-longhorn", "source": "/dev/nvme0n1p2", "target":
 "/var/mnt/longhorn", "filesystem": "xfs", "read_only": false}

Environment

  • Talos version: v1.10.4
  • Kubernetes version: v1.32.5
  • Platform: Baremetal

Syntax3rror404 avatar Jun 15 '25 14:06 Syntax3rror404

After a reboot and wiping again it now works, so it looks like something with the wipe command doesn't work as expected.

Syntax3rror404 avatar Jun 15 '25 14:06 Syntax3rror404

Hey @Syntax3rror404 how did you apply the UserVolumeConfig ? I can't find anything on the docs on how to apply these configs in talos :/

ares-b avatar Jun 16 '25 16:06 ares-b

@ares-b its below the machineconfig separated with the yaml --- seperator. Then you can apply them with apply-config.

Edit: here you can find this in the docs https://www.talos.dev/v1.10/reference/configuration/

Syntax3rror404 avatar Jun 16 '25 16:06 Syntax3rror404

This might happen when partition management would be forced to insert more partitions "before" existing partitions.

E.g. it was:

p1 | p2 

Then p1 is removed, while p2 is in use, and two new partitions are trying to be inserted before:

p1 | <new-p2> | p2 <should be p3>

Linux can't handle this without a reboot.

We'll need some additional data (talosctl support) when this happens to investigate further.

smira avatar Jun 27 '25 10:06 smira

@smira I think after [talos] wiping block device "nvme0n1" there shouldnt be any partion like p1 or p2 left on the device.

This issue is reproducible, had them 6 times of 6 nodes.

  1. Create UserConfigVolume
  2. Use them with a storageclass like longhorn or rook
  3. Delete the UserConfigVolume
  4. Apply the manifest again with UserConfigVolume
  5. And talos stuck
  6. Delete the UserConfigVolume
  7. Reboot the node
  8. Apply the manifest again with UserConfigVolume
  9. Now your volume is present again

With machine.disks you can try the same and it will work without a reboot.

If the partition is left on device, there could be a bug inside the wiping/mounting mechanism.

Syntax3rror404 avatar Jun 27 '25 13:06 Syntax3rror404

It's not clear in your sequence - do you wipe a disk in between?

smira avatar Jun 27 '25 13:06 smira