talos MachineConfig.disks nofail and other options

Currently Talos reboots system if machine.disks.* is not found.

It will be grate skip that error, it helps to fix disk problems or any issues.
second disk can be hdd/ssd/nvmv and better to set mount options like noatime,discard

machineconfig proposal

machine:
  disks:
    - device: /dev/sdb
      nofail: true (default false)
      partitions:
        - mountpoint: /var/mnt/extra
          options: (default not exist)
            - noatime
            - discard
            - nofail

Nov 24 '21 22:11 sergelogvinov

I wonder if failures like this one should actually pause the boot sequence (to avoid further damage, e.g. by writing to the mountpoint vs. the mounted disk), but keep apid running so that the issue can be fixed or mitigated?

Nov 25 '21 12:11 smira

The main goal of it, to continue run node. After success up and running, prometheus/hw raid exporter gather more details and send it to operator.

apid does not have any metrics/daemons to collect such information.

Nov 25 '21 17:11 sergelogvinov

This is a tough one. I don't have any ideas. What should we do if the disk isn't there?

Nov 29 '21 22:11 andrewrynhard

Notes from planning meeting

I think it makes sense to split the issue:

mount options which are passed down to the mount() syscall (like noatime, discard) are a great idea, and we should definitely implement that
options which change the semantics or behavior of Talos (like nofail) are tricky: if the mount operation fails and gets skipped, this makes the mountpoint empty and refer to a different disk which might cause other cascading failure (e.g. /var/mnt/extra was storing database data directory). Instead of adding nofail we would rather look towards changing Talos behavior on failures to pause the boot process and leave apid running for the operator to make a decision: change the machine configuration to remove the failed mount or perform other recovery procedures.

Dec 08 '21 17:12 smira

See also #4669

Dec 08 '21 17:12 smira

stop booting is not a good idea, it cannot fix the problem and required the human to fix non critical issue.

When you set nofail you usually know what you do. This is not default mount flag. And many linux users use this flag every day...

So as an other option, Talos can set taints NoExecute to the node. Which can allow to run only node critical pods. Kubelet flag --register-with-taints. This helps automation system to fix the problem (if it possible of course)

Dec 09 '21 06:12 sergelogvinov

Booting up even to kubelet might lead to catastrophic failures, which we'd rather avoid. Operator can remove the mount if it's really not necessary, but common case might be that this option opens a door towards critical operational mistakes.

Dec 09 '21 20:12 smira

I found a more pressing issue for this feature: inode32. We are running dind for our build agents and unfortunately we get EOVERFLOWs in docker volumes when they are located on our 4Tb drive, while they work fine from the 400gb drive. The solution for that appears to add the inode32 mount option, even though it comes with its own downsides. The issue is documented here: Red-hat article in "Inode numbers".

Remounting the volume with mount -o remount,inode32 /host/var/mnt/extra temporarily solves this issue for us until the next reboot.

Sep 14 '23 11:09 KarstenB

@smira I am currently facing a real problem with the lack of other options. I really need inode32 for my mounts to run our build jobs, which have some 32 bit tools. They fail in odd ways and can partially corrupt state because of it. Is there a way to ensure that I always have inode32 other than running a script in a loop in a privileged container?

Apr 15 '24 13:04 KarstenB

We don't have a solution for this issue at the moment, the #8367 is supposed to solve it, but it will come no earlier than Talos 1.8.

The issue is pretty clear, and Talos support is not great in this area at the moment.

Apr 15 '24 13:04 smira

talos talos copied to clipboard

MachineConfig.disks nofail and other options

Notes from planning meeting

talos
talos copied to clipboard