talos icon indicating copy to clipboard operation
talos copied to clipboard

[Feature Request] DGX Spark Support

Open 0x77dev opened this issue 1 month ago • 16 comments

Feature Request

Talos Linux support for NVIDIA DGX Spark.

Summary

~~When building Talos Linux ARM64 images with bootloader: grub, the imager places a PE32+ EFI executable (UKI) at /boot/vmlinuz instead of a raw ARM64 kernel. This causes GRUB's linux command to fail, as it cannot execute EFI applications.~~

~~Request: When bootloader: grub is specified, the imager should extract and place the raw kernel from the UKI so GRUB can boot it.~~

Update: Confirmed UKI works on v1.11.5, but still experiencing two issues, with USB and onboard Ethernet.

Hardware Platform

NVIDIA DGX Spark

The DGX Spark is NVIDIA's desktop AI workstation based on the Grace Blackwell architecture:

System Architecture:

  • SoC: NVIDIA GB10 Grace Blackwell (integrated CPU + GPU)
  • CPU: 20-core ARM64 (4× Cortex-X925 performance cores + 16× Cortex-A725 efficiency cores)
  • Memory: 128GB unified LPDDR5X (shared between CPU and GPU)
  • GPU: Blackwell architecture with 5th gen Tensor cores
  • Storage: Samsung NVMe PCIe 4.0
  • Network: 4× Mellanox ConnectX-7 (MT2910 Family)
    • 2× 100GbE QSFP ports (RoCE capable)
    • 1× 10GbE RJ45 (Realtek r8152)
    • 1× Wi-Fi 7 (MediaTek MT7925E)

Boot Environment:

  • Firmware: AMI UEFI (ARM64)
  • Current OS: DGX OS (Ubuntu 24.04-based) boots via: shimaa64.efi → grubaa64.efi → kernel
  • Secure Boot: Disabled for testing

Console Access:

  • Serial: UART at MMIO address 0x16A00000, 921600 baud
  • Early console: earlycon=uart,mmio32,0x16A00000
  • Runtime: console=ttyS0,921600

Problem Description

Outdated: Boot Issues

Attempt 1: Standard Talos v1.11.3 ARM64 ISO

Downloaded metal-arm64.iso from Image Factory (default schematic).

Boot structure:

/EFI/BOOT/BOOTAA64.EFI
/EFI/Linux/Talos-v1.11.3.efi

Result: DGX Spark UEFI firmware did not recognize or attempt to boot from the USB device.

Attempt 2: Custom v1.11.3 ISO with GRUB

Built custom ISO with explicit GRUB bootloader:

Imager profile:

arch: arm64
platform: metal
version: v1.11.3

output:
  kind: iso
  imageOptions:
    bootloader: grub

customization:
  extraKernelArgs:
    - earlycon=uart,mmio32,0x16A00000
    - console=ttyS0,921600
    - console=tty0
    - init_on_alloc=0

Build command:

cat profile.yaml | docker run --rm -i \
  -v $PWD:/out \
  -v /dev:/dev --privileged \
  ghcr.io/siderolabs/imager:v1.11.3 \
  -

Result:

  • ISO built (267MB)
  • GRUB menu appeared on DGX Spark (first progress!)
  • Selecting "Talos ISO" failed with:
    ../src/boot/boot.c:2560@ image_start: Error loading \EFI\Linux\Talos-v1.11.3.efi: Not found
    

The error path src/boot/boot.c is from systemd-boot source, suggesting the PE32+ executable contains systemd-boot stub code.

Attempt 3: Test v1.9.3 (when GRUB was default)

Tested v1.9.3 with the same GRUB profile to see if behavior changed between versions.

Result: Same issue - /boot/vmlinuz is still PE32+ executable.

Attempt 4: Disk Image Build

Also tried kind: image with bootloader: grub:

Result:

  • Created partitions (EFI, BIOS, BOOT, META)
  • Installed GRUB with grub-install --removable --target=arm64-efi
  • Same PE32+ kernel at /A/vmlinuz
  • Build error: unsupported disk format: unknown

Build Process Observation

Even with bootloader: grub specified, the imager builds a UKI:

building UKI...
    copying /usr/install/arm64/systemd-boot.efi to /tmp/imager.../systemd-boot.efi
building UKI...
    assembling UKI
UKI ready
building ISO...
    copying /usr/install/arm64/vmlinuz to /tmp/imager.../iso/boot/vmlinuz

The file /usr/install/arm64/vmlinuz in the installer-base container appears to be the UKI, and the imager copies it directly to the boot path where GRUB expects a raw kernel.

Attempted Workaround

Extracted raw kernel from UKI using dd:

dd if=boot/vmlinuz of=vmlinuz-raw bs=4096 skip=1 count=4768
file vmlinuz-raw  # Shows as "data", 19MB

Not yet tested on hardware.

Feature Request

Maybe when output.imageOptions.bootloader is set to grub for ARM64:

  1. Extract raw kernel from the UKI's .linux section
  2. Place raw kernel at /boot/vmlinuz (ISO) or /A/vmlinuz (disk image)
  3. Configure GRUB to load this raw kernel

This would enable GRUB-based boot on ARM64 UEFI systems that don't properly support systemd-boot/UKI.

Testing Offer

I have access to 3x DGX Spark and can test any fixes:

  • Serial port access
  • Remote access via PiKVM
  • Multiple boot methods (USB, virtual media, NVMe installation)
  • Hardware compatibility (ConnectX-7, NVMe, etc.)

Happy to iterate on testing and provide detailed feedback.

Let me know if you need any additional context/info or logs!

0x77dev avatar Nov 08 '25 04:11 0x77dev

For Reference: NVIDIA Recovery Media

NVIDIA provides recovery media (dgx-spark-recovery-image-1.91.51-1.tar.gz) containing a minimal boot environment ("FastOS") that successfully boots on DGX Spark via UEFI and flashes the OS to internal storage.

Structure observed:

/efi/boot/
  ├── bootaa64.efi
  ├── grubaa64.efi
  └── mmaa64.efi
/boot/
/vmlinuz         # Presumably raw kernel
/initrd
/fw/             # Firmware blobs
fastos.partab    # Partition table template

NVIDIA's recovery media boots successfully on the same UEFI firmware using GRUB. Examining how their kernel is packaged (likely raw, not UKI) could inform the proper approach for Talos GRUB images on ARM64.

0x77dev avatar Nov 08 '25 04:11 0x77dev

FYi @rothgar was able to boot normal UKI images

frezbo avatar Nov 08 '25 05:11 frezbo

FYi @rothgar successfully booted normal UKI images.

I just tried the 1.11.5 arm64 ISO from image factory (with secure boot), and it worked perfectly.

I apologize for wasting your time and attention. Thank you!

0x77dev avatar Nov 08 '25 05:11 0x77dev

Reopening the issue, experiencing two issues at the moment (running v1.11.5):

  1. Onboard 10GbE RJ45 (based on Realtek r8152) is not picked up by Talos, I suspect the kernel does not yet have the driver updated – https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f24f7b2f3af9e008ded20f804d7829ee2efd43f2
  2. USB ports are not working post-boot, can't use mouse or keyboard, not via PiKVM, not directly, not via USB Hub.

Side notes for those trying to get Talos on Spark:

  • When using PiKVM, do not enable Flash mode for Mass Storage Device emulation, leave it as CD/DVD. The .iso won't boot and will display the error I initially reported when opening this issue.
  • PXE via booter works without any issues when:
    1. Secure Boot is disabled in the BIOS
    2. PXE and Network Stack are enabled
    3. Static configuration or DHCP is set for PXE

Screenshot of v1.11.5 booting, confirming UKI works –

Image

0x77dev avatar Nov 09 '25 04:11 0x77dev

Got 10GbE ethernet working!

  • https://github.com/siderolabs/pkgs/pull/1367
  • https://github.com/siderolabs/extensions/pull/877

Can build an .iso like this:

docker run --rm -t \
  -v $PWD/_out:/out \
  ghcr.io/0x77dev/talos-imager-dgx-spark:v1.11.5 iso \
  --arch arm64 \
  --system-extension-image ghcr.io/0x77dev/realtek-r8127:11.015.00-v1.11.0-29-gaee690b-dirty@sha256:5936f78831198dab253cebe1e72ac9f5ed66cd51786644ca270106919208d080

And a config patch (necessary if you plan to install):

machine:
  install:
    # Use custom installer with RTL8127 and NVIDIA extensions baked in
    image: ghcr.io/0x77dev/installer-dgx-spark:v1.11.5
    
    # DGX Spark console configuration
    extraKernelArgs:
      - earlycon=uart,mmio32,0x16A00000
      - console=ttyS0,921600
      - console=tty0

Image

0x77dev avatar Nov 10 '25 05:11 0x77dev

Thanks for the realtek packages. That was probably the last missing piece so network doesn't cut out when the machine reboots.

rothgar avatar Nov 10 '25 14:11 rothgar

I don't think so, this is targeting 1.11.

In 1.12 alpha 2 you already use the linux kernel where this driver comes as standard https://www.phoronix.com/news/Linux-6.16-Realtek-RTL8127A so this neither of this is needed

Image

danacr avatar Nov 10 '25 15:11 danacr

@danacr – https://github.com/siderolabs/pkgs/pull/1367#issuecomment-3512937496

0x77dev avatar Nov 10 '25 17:11 0x77dev

Hi @0x77dev ! Have you tried our 1.12.0-beta.0 build? The new version contains Ethernet driver, and might also fix USB issues - I suspect those might be caused by missing driver as well.

shanduur avatar Nov 24 '25 10:11 shanduur

The machine boots with 1.12.0-beta.0, but the nvidia persistenced service fails to start. The kernel modules load but I'm not sure if AI workloads will function as expected.

rothgar avatar Nov 24 '25 18:11 rothgar

Hey @rothgar @shanduur, will try and let you know how it goes! Just noticed the release

0x77dev avatar Nov 24 '25 18:11 0x77dev

Mine is currently functional in 12.0-beta.0 with the following additions after building with this profile

profile.yaml:

# Talos Imager Profile for dgx-spark node with NVIDIA GPU support
# This profile builds a custom installer image with all required extensions
# for Talos v1.12.0-beta.0 until factory images become available

arch: arm64
platform: metal
secureboot: false
version: v1.12.0-beta.0

input:
  kernel:
    path: /usr/install/arm64/vmlinuz
  initramfs:
    path: /usr/install/arm64/initramfs.xz
  baseInstaller:
    imageRef: ghcr.io/siderolabs/installer:v1.12.0-beta.0

  # System extensions to bundle into the installer
  systemExtensions:
    - imageRef: ghcr.io/siderolabs/iscsi-tools:v0.2.0@sha256:885ff85993e01853e47b1045a8a939ec8510bf7166b8a9da5fd2b8dd94721314
    - imageRef: ghcr.io/siderolabs/util-linux-tools:2.41.2@sha256:c16811b18a32582fcacb08c32db9265c4ba0d3898e19f367799695890539f816
    - imageRef: ghcr.io/siderolabs/binfmt-misc:v1.12.0-beta.0@sha256:093447027eac366ac9475819b3c914254ee1dfc9d80dd2d2550c92c1bcf7d3ca
    - imageRef: ghcr.io/siderolabs/nonfree-kmod-nvidia-lts:580.95.05-v1.12.0-beta.0@sha256:0931da72620cbc3003b59e0e15ca7cc3f5c6fd994edacb6ee9baf77be337bfe0
    - imageRef: ghcr.io/siderolabs/nvidia-container-toolkit-lts:580.95.05-v1.18.0@sha256:f003332c379c5c544c5bb55feb9a05f08a19eec8266b887d9e1e6d2b1a6dcde4

output:
  kind: installer
  outFormat: raw

Node patch additions:

machine:
  kernel:
      # Kernel modules to load.
      modules:
        - name: nvidia
        - name: nvidia_uvm
        - name: nvidia_drm
        - name: nvidia_modeset

egallis31 avatar Nov 24 '25 19:11 egallis31

@egallis31 what is the output of talosctl services on the node with this build?

rothgar avatar Nov 24 '25 19:11 rothgar

@rothgar output for talosctl services

Been having a known error since installation with nvidia-persistencd, but has not effected the GPU usage or stability yet. Ideally this would be resolved as well

NODE        SERVICE                   STATE     HEALTH   LAST CHANGE      LAST EVENT
10.2.2.96   apid                      Running   OK       133h27m38s ago   Health check successful
10.2.2.96   auditd                    Running   OK       133h27m50s ago   Health check successful
10.2.2.96   containerd                Running   OK       133h27m50s ago   Health check successful
10.2.2.96   cri                       Running   OK       133h27m38s ago   Health check successful
10.2.2.96   dashboard                 Running   ?        133h27m47s ago   Process Process(["/sbin/dashboard"]) started with PID 6032
10.2.2.96   ext-iscsid                Running   ?        133h27m38s ago   Started task ext-iscsid (PID 6341) for container ext-iscsid
10.2.2.96   ext-nvidia-persistenced   Waiting   ?        4s ago           Error running Containerd(ext-nvidia-persistenced), going to restart forever: task "ext-nvidia-persistenced" failed: exit code 1 (last log "2025/11/24 19:55:06 nvidia-persistenced-wrapper: error starting nvidia-persistenced: fork/exec /usr/local/bin/nvidia-persistenced: no such file or directory")
10.2.2.96   kubelet                   Running   OK       133h27m36s ago   Health check successful
10.2.2.96   machined                  Running   OK       133h27m50s ago   Health check successful
10.2.2.96   syslogd                   Running   OK       133h27m49s ago   Health check successful
10.2.2.96   udevd                     Running   OK       133h27m48s ago   Health check successful

egallis31 avatar Nov 24 '25 19:11 egallis31

Yep, that error is exactly what I've seen. The GPU still loads and is available, but I have seen Talos reboot/rollback after an upgrade because the service never starts. I think this is blocking Talos from becoming healthy/ready and tries to recover to the old partition.

I'm not exactly sure if that's what's happening, but what I assume to be the case.

rothgar avatar Nov 24 '25 21:11 rothgar

I also attempted to bundle the glibc extension to attempt to resolve the issue, but no changes.

egallis31 avatar Nov 24 '25 21:11 egallis31

Just checking - will the 1.12 release work out of the box for the issues mentioned in this Issue? Just checking in if that's planned -- we're currently blocked on clustering these devices with 1.11. Thanks :)

mchaker avatar Dec 17 '25 22:12 mchaker

Yes, the spark works out of the box with Talos 1.12. You need to add the system extensions and patch the machine to load the kernel modules like you do for any NVIDIA hardware.

rothgar avatar Dec 17 '25 23:12 rothgar