talos icon indicating copy to clipboard operation
talos copied to clipboard

MachineConfig.disks nofail and other options

Open sergelogvinov opened this issue 4 years ago • 10 comments

Currently Talos reboots system if machine.disks.* is not found.

  1. It will be grate skip that error, it helps to fix disk problems or any issues.
  2. second disk can be hdd/ssd/nvmv and better to set mount options like noatime,discard

machineconfig proposal

machine:
  disks:
    - device: /dev/sdb
      nofail: true (default false)
      partitions:
        - mountpoint: /var/mnt/extra
          options: (default not exist)
            - noatime
            - discard
            - nofail

sergelogvinov avatar Nov 24 '21 22:11 sergelogvinov

I wonder if failures like this one should actually pause the boot sequence (to avoid further damage, e.g. by writing to the mountpoint vs. the mounted disk), but keep apid running so that the issue can be fixed or mitigated?

smira avatar Nov 25 '21 12:11 smira

The main goal of it, to continue run node. After success up and running, prometheus/hw raid exporter gather more details and send it to operator.

apid does not have any metrics/daemons to collect such information.

sergelogvinov avatar Nov 25 '21 17:11 sergelogvinov

This is a tough one. I don't have any ideas. What should we do if the disk isn't there?

andrewrynhard avatar Nov 29 '21 22:11 andrewrynhard

Notes from planning meeting

I think it makes sense to split the issue:

  • mount options which are passed down to the mount() syscall (like noatime, discard) are a great idea, and we should definitely implement that
  • options which change the semantics or behavior of Talos (like nofail) are tricky: if the mount operation fails and gets skipped, this makes the mountpoint empty and refer to a different disk which might cause other cascading failure (e.g. /var/mnt/extra was storing database data directory). Instead of adding nofail we would rather look towards changing Talos behavior on failures to pause the boot process and leave apid running for the operator to make a decision: change the machine configuration to remove the failed mount or perform other recovery procedures.

smira avatar Dec 08 '21 17:12 smira

See also #4669

smira avatar Dec 08 '21 17:12 smira

stop booting is not a good idea, it cannot fix the problem and required the human to fix non critical issue.

When you set nofail you usually know what you do. This is not default mount flag. And many linux users use this flag every day...

So as an other option, Talos can set taints NoExecute to the node. Which can allow to run only node critical pods. Kubelet flag --register-with-taints. This helps automation system to fix the problem (if it possible of course)

sergelogvinov avatar Dec 09 '21 06:12 sergelogvinov

Booting up even to kubelet might lead to catastrophic failures, which we'd rather avoid. Operator can remove the mount if it's really not necessary, but common case might be that this option opens a door towards critical operational mistakes.

smira avatar Dec 09 '21 20:12 smira

I found a more pressing issue for this feature: inode32. We are running dind for our build agents and unfortunately we get EOVERFLOWs in docker volumes when they are located on our 4Tb drive, while they work fine from the 400gb drive. The solution for that appears to add the inode32 mount option, even though it comes with its own downsides. The issue is documented here: Red-hat article in "Inode numbers".

Remounting the volume with mount -o remount,inode32 /host/var/mnt/extra temporarily solves this issue for us until the next reboot.

KarstenB avatar Sep 14 '23 11:09 KarstenB

@smira I am currently facing a real problem with the lack of other options. I really need inode32 for my mounts to run our build jobs, which have some 32 bit tools. They fail in odd ways and can partially corrupt state because of it. Is there a way to ensure that I always have inode32 other than running a script in a loop in a privileged container?

KarstenB avatar Apr 15 '24 13:04 KarstenB

We don't have a solution for this issue at the moment, the #8367 is supposed to solve it, but it will come no earlier than Talos 1.8.

The issue is pretty clear, and Talos support is not great in this area at the moment.

smira avatar Apr 15 '24 13:04 smira