talos icon indicating copy to clipboard operation
talos copied to clipboard

epic: system extensions

Open smira opened this issue 3 years ago • 2 comments
trafficstars

Sub-tasks

  • [x] #4812
  • [x] #4813
  • [x] #4814
  • [x] #4815
  • [x] #4816
  • [x] system extensions: building images with system extensions included
  • [x] system extensions: integration testing
  • [x] system extensions: build/tag with correct versions
  • [ ] system extensions: tool to validate the extension image

UX Flow

User adds references to the system extensions they want to have enabled in the machine config:

machine:
  extensions:
     - image: ghcr.io/talos-systems/ntpd:v0.1.0
     - image: ghcr.io/talos-systems/nvidia-container-runtime:v0.2.0

When installer runs, it pulls down requested additional images and puts them to final initramfs.xz. After reboot requested system extensions are enabled.

Technical Details

Extension Format

Each extension is a collection of a manifest and a set of squashfs filesystem images:

\
|- manifest.yaml
\- /rootfs
|  \- /usr
|    \- /etc
|	     |- container.yaml
|      \- binary*
|  \- /lib
|    \- /firwmare
|      \- nvidia-tegra

The manifest describes the contents of the system extension:

metadata:
  name: ntpd
  version: v0.1.0
  author: John Smith
  description: |>
     This system extensions runs `ntpd` replacing Talos built-in NTP time sync.
  compatibility:
    talos:
      version: >= v0.15.0
    linux:
       version: 5.15.6

Extensions should have some restrictions, e.g. mount paths should be below /usr.

When packaged as a container image, whole directory structure is packaged as a container image and pushed to some container registry.

Installing Extensions

When Talos pulls installer image, it runs it providing machine configuration with flag --list-extensions and installer outputs a list of images to be pulled:

ghcr.io/talos-systems/ntpd:v0.1.0
ghcr.io/talos-systems/nvidia-container-runtime:v0.2.0

Talos pulls listed images to the system containerd and runs the installer container so that pulled images are unpacked and presented in /opt/extentions to the container. Extensions are validated: are they compatible with the version of Talos to be installed, are they valid, etc. If an extension is invalid, installation is aborted.

Talos installer decompresses bundled initramfs.xz, builds final extensions.yaml and puts extensions and related files to the initramfs.xz.

Talos installer should validate that /boot partition has enough space to hold boot assets before writing them.

Format of extensions.yaml:

layers:
  - image: ghcr.io-talos-systems-ntpd-v0.1.0-etc.sqshfs
  - image: ghcr.io-talos-systems-ntpd-v0.1.0-lib.sqshfs
  - image: ghcr.io-talos-systems-nvidia-container-runtime-v0.2.0-usr.sqshfs

Layers are written in the order they are mentioned in the machine configuration.

Mounting Extensions on Boot

When first init process (before machined) starts, it analyzes contents of the initramfs: if there’s extensions.yaml, after mounting initial rootfs.sqshfs to /, init mounts all layers as instructed using read-only overlayfs mounts.

Use Cases

Custom Container Runtime

Examples: NVIDIA container runtime, gVisor, Kata containers, etc.

Requires: new runc binary and additional configuration for containerd. Adding extra files should be enough. Actual container runtime can be configured in the Kubernetes pod spec.

Custom Service

Examples: additional API layer, HTTP server which runs early on boot.

Requires: container rootfs and container spec. Adding extra files is enough, Talos will pick up container out of the spec and run it as a service.

Additional files can be written via .files in the machine configuration, service can wait for them to exist before starting.

Custom Kernel Module

Example: NVIDIA kernel module.

The only requirement is putting modules to /lib/modules/<kernel-release>, module should be built for the exact version of Talos.

Module is loaded via machine configuration.

COSI Plugin

Example: custom ntpd.

Works same as custom service, but requires a link to COSI runtime (API) with some restricted level of access. It might publish its own resources, consume Talos resources and inject Talos resources (e.g. configure additional addresses or links).

This requires protobufs for resources and some API/CLI to work with resources, replacing parts of Talos.

Tasks

  • implement mounting extensions as overlayfs in app/init
  • installing extensions (installer): verifying, packing, writing to initramfs
  • example extension: gVisor container runtime (or any other container runtime)
  • documentation for the extension spec
  • bldr support for producing extensions more easily

smira avatar Jan 17 '22 15:01 smira

Can this also be used to load kernel modules (let's say, zfs support)?

The paths from https://github.com/talos-systems/extensions seem to suggest it's not…

flokli avatar Jan 26 '22 08:01 flokli

Can this also be used to load kernel modules (let's say, zfs support)?

The paths from https://github.com/talos-systems/extensions seem to suggest it's not…

It might need more work, as we need to handle module metadata files in some reasonable way (if it's more than one module). But it's one of the goals.

smira avatar Jan 26 '22 12:01 smira