[Feature Request] DGX Spark Support
Feature Request
Talos Linux support for NVIDIA DGX Spark.
Summary
~~When building Talos Linux ARM64 images with bootloader: grub, the imager places a PE32+ EFI executable (UKI) at /boot/vmlinuz instead of a raw ARM64 kernel. This causes GRUB's linux command to fail, as it cannot execute EFI applications.~~
~~Request: When bootloader: grub is specified, the imager should extract and place the raw kernel from the UKI so GRUB can boot it.~~
Update: Confirmed UKI works on v1.11.5, but still experiencing two issues, with USB and onboard Ethernet.
Hardware Platform
NVIDIA DGX Spark
The DGX Spark is NVIDIA's desktop AI workstation based on the Grace Blackwell architecture:
System Architecture:
- SoC: NVIDIA GB10 Grace Blackwell (integrated CPU + GPU)
- CPU: 20-core ARM64 (4× Cortex-X925 performance cores + 16× Cortex-A725 efficiency cores)
- Memory: 128GB unified LPDDR5X (shared between CPU and GPU)
- GPU: Blackwell architecture with 5th gen Tensor cores
- Storage: Samsung NVMe PCIe 4.0
- Network: 4× Mellanox ConnectX-7 (MT2910 Family)
- 2× 100GbE QSFP ports (RoCE capable)
- 1× 10GbE RJ45 (Realtek r8152)
- 1× Wi-Fi 7 (MediaTek MT7925E)
Boot Environment:
- Firmware: AMI UEFI (ARM64)
- Current OS: DGX OS (Ubuntu 24.04-based) boots via:
shimaa64.efi → grubaa64.efi → kernel - Secure Boot: Disabled for testing
Console Access:
- Serial: UART at MMIO address 0x16A00000, 921600 baud
- Early console:
earlycon=uart,mmio32,0x16A00000 - Runtime:
console=ttyS0,921600
Problem Description
Outdated: Boot Issues
Attempt 1: Standard Talos v1.11.3 ARM64 ISO
Downloaded metal-arm64.iso from Image Factory (default schematic).
Boot structure:
/EFI/BOOT/BOOTAA64.EFI
/EFI/Linux/Talos-v1.11.3.efi
Result: DGX Spark UEFI firmware did not recognize or attempt to boot from the USB device.
Attempt 2: Custom v1.11.3 ISO with GRUB
Built custom ISO with explicit GRUB bootloader:
Imager profile:
arch: arm64
platform: metal
version: v1.11.3
output:
kind: iso
imageOptions:
bootloader: grub
customization:
extraKernelArgs:
- earlycon=uart,mmio32,0x16A00000
- console=ttyS0,921600
- console=tty0
- init_on_alloc=0
Build command:
cat profile.yaml | docker run --rm -i \
-v $PWD:/out \
-v /dev:/dev --privileged \
ghcr.io/siderolabs/imager:v1.11.3 \
-
Result:
- ISO built (267MB)
- GRUB menu appeared on DGX Spark (first progress!)
- Selecting "Talos ISO" failed with:
../src/boot/boot.c:2560@ image_start: Error loading \EFI\Linux\Talos-v1.11.3.efi: Not found
The error path src/boot/boot.c is from systemd-boot source, suggesting the PE32+ executable contains systemd-boot stub code.
Attempt 3: Test v1.9.3 (when GRUB was default)
Tested v1.9.3 with the same GRUB profile to see if behavior changed between versions.
Result: Same issue - /boot/vmlinuz is still PE32+ executable.
Attempt 4: Disk Image Build
Also tried kind: image with bootloader: grub:
Result:
- Created partitions (EFI, BIOS, BOOT, META)
- Installed GRUB with
grub-install --removable --target=arm64-efi - Same PE32+ kernel at
/A/vmlinuz - Build error:
unsupported disk format: unknown
Build Process Observation
Even with bootloader: grub specified, the imager builds a UKI:
building UKI...
copying /usr/install/arm64/systemd-boot.efi to /tmp/imager.../systemd-boot.efi
building UKI...
assembling UKI
UKI ready
building ISO...
copying /usr/install/arm64/vmlinuz to /tmp/imager.../iso/boot/vmlinuz
The file /usr/install/arm64/vmlinuz in the installer-base container appears to be the UKI, and the imager copies it directly to the boot path where GRUB expects a raw kernel.
Attempted Workaround
Extracted raw kernel from UKI using dd:
dd if=boot/vmlinuz of=vmlinuz-raw bs=4096 skip=1 count=4768
file vmlinuz-raw # Shows as "data", 19MB
Not yet tested on hardware.
Feature Request
Maybe when output.imageOptions.bootloader is set to grub for ARM64:
- Extract raw kernel from the UKI's
.linuxsection - Place raw kernel at
/boot/vmlinuz(ISO) or/A/vmlinuz(disk image) - Configure GRUB to load this raw kernel
This would enable GRUB-based boot on ARM64 UEFI systems that don't properly support systemd-boot/UKI.
Testing Offer
I have access to 3x DGX Spark and can test any fixes:
- Serial port access
- Remote access via PiKVM
- Multiple boot methods (USB, virtual media, NVMe installation)
- Hardware compatibility (ConnectX-7, NVMe, etc.)
Happy to iterate on testing and provide detailed feedback.
Let me know if you need any additional context/info or logs!
For Reference: NVIDIA Recovery Media
NVIDIA provides recovery media (dgx-spark-recovery-image-1.91.51-1.tar.gz) containing a minimal boot environment ("FastOS") that successfully boots on DGX Spark via UEFI and flashes the OS to internal storage.
Structure observed:
/efi/boot/
├── bootaa64.efi
├── grubaa64.efi
└── mmaa64.efi
/boot/
/vmlinuz # Presumably raw kernel
/initrd
/fw/ # Firmware blobs
fastos.partab # Partition table template
NVIDIA's recovery media boots successfully on the same UEFI firmware using GRUB. Examining how their kernel is packaged (likely raw, not UKI) could inform the proper approach for Talos GRUB images on ARM64.
FYi @rothgar was able to boot normal UKI images
FYi @rothgar successfully booted normal UKI images.
I just tried the 1.11.5 arm64 ISO from image factory (with secure boot), and it worked perfectly.
I apologize for wasting your time and attention. Thank you!
Reopening the issue, experiencing two issues at the moment (running v1.11.5):
- Onboard 10GbE RJ45 (based on Realtek r8152) is not picked up by Talos, I suspect the kernel does not yet have the driver updated – https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f24f7b2f3af9e008ded20f804d7829ee2efd43f2
- USB ports are not working post-boot, can't use mouse or keyboard, not via PiKVM, not directly, not via USB Hub.
Side notes for those trying to get Talos on Spark:
- When using PiKVM, do not enable
Flashmode for Mass Storage Device emulation, leave it asCD/DVD. The.isowon't boot and will display the error I initially reported when opening this issue. - PXE via booter works without any issues when:
- Secure Boot is disabled in the BIOS
- PXE and Network Stack are enabled
- Static configuration or DHCP is set for PXE
Screenshot of v1.11.5 booting, confirming UKI works –
Got 10GbE ethernet working!
- https://github.com/siderolabs/pkgs/pull/1367
- https://github.com/siderolabs/extensions/pull/877
Can build an .iso like this:
docker run --rm -t \
-v $PWD/_out:/out \
ghcr.io/0x77dev/talos-imager-dgx-spark:v1.11.5 iso \
--arch arm64 \
--system-extension-image ghcr.io/0x77dev/realtek-r8127:11.015.00-v1.11.0-29-gaee690b-dirty@sha256:5936f78831198dab253cebe1e72ac9f5ed66cd51786644ca270106919208d080
And a config patch (necessary if you plan to install):
machine:
install:
# Use custom installer with RTL8127 and NVIDIA extensions baked in
image: ghcr.io/0x77dev/installer-dgx-spark:v1.11.5
# DGX Spark console configuration
extraKernelArgs:
- earlycon=uart,mmio32,0x16A00000
- console=ttyS0,921600
- console=tty0
Thanks for the realtek packages. That was probably the last missing piece so network doesn't cut out when the machine reboots.
I don't think so, this is targeting 1.11.
In 1.12 alpha 2 you already use the linux kernel where this driver comes as standard https://www.phoronix.com/news/Linux-6.16-Realtek-RTL8127A so this neither of this is needed
@danacr – https://github.com/siderolabs/pkgs/pull/1367#issuecomment-3512937496
Hi @0x77dev ! Have you tried our 1.12.0-beta.0 build? The new version contains Ethernet driver, and might also fix USB issues - I suspect those might be caused by missing driver as well.
The machine boots with 1.12.0-beta.0, but the nvidia persistenced service fails to start. The kernel modules load but I'm not sure if AI workloads will function as expected.
Hey @rothgar @shanduur, will try and let you know how it goes! Just noticed the release
Mine is currently functional in 12.0-beta.0 with the following additions after building with this profile
profile.yaml:
# Talos Imager Profile for dgx-spark node with NVIDIA GPU support
# This profile builds a custom installer image with all required extensions
# for Talos v1.12.0-beta.0 until factory images become available
arch: arm64
platform: metal
secureboot: false
version: v1.12.0-beta.0
input:
kernel:
path: /usr/install/arm64/vmlinuz
initramfs:
path: /usr/install/arm64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.12.0-beta.0
# System extensions to bundle into the installer
systemExtensions:
- imageRef: ghcr.io/siderolabs/iscsi-tools:v0.2.0@sha256:885ff85993e01853e47b1045a8a939ec8510bf7166b8a9da5fd2b8dd94721314
- imageRef: ghcr.io/siderolabs/util-linux-tools:2.41.2@sha256:c16811b18a32582fcacb08c32db9265c4ba0d3898e19f367799695890539f816
- imageRef: ghcr.io/siderolabs/binfmt-misc:v1.12.0-beta.0@sha256:093447027eac366ac9475819b3c914254ee1dfc9d80dd2d2550c92c1bcf7d3ca
- imageRef: ghcr.io/siderolabs/nonfree-kmod-nvidia-lts:580.95.05-v1.12.0-beta.0@sha256:0931da72620cbc3003b59e0e15ca7cc3f5c6fd994edacb6ee9baf77be337bfe0
- imageRef: ghcr.io/siderolabs/nvidia-container-toolkit-lts:580.95.05-v1.18.0@sha256:f003332c379c5c544c5bb55feb9a05f08a19eec8266b887d9e1e6d2b1a6dcde4
output:
kind: installer
outFormat: raw
Node patch additions:
machine:
kernel:
# Kernel modules to load.
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
@egallis31 what is the output of talosctl services on the node with this build?
@rothgar output for talosctl services
Been having a known error since installation with nvidia-persistencd, but has not effected the GPU usage or stability yet. Ideally this would be resolved as well
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
10.2.2.96 apid Running OK 133h27m38s ago Health check successful
10.2.2.96 auditd Running OK 133h27m50s ago Health check successful
10.2.2.96 containerd Running OK 133h27m50s ago Health check successful
10.2.2.96 cri Running OK 133h27m38s ago Health check successful
10.2.2.96 dashboard Running ? 133h27m47s ago Process Process(["/sbin/dashboard"]) started with PID 6032
10.2.2.96 ext-iscsid Running ? 133h27m38s ago Started task ext-iscsid (PID 6341) for container ext-iscsid
10.2.2.96 ext-nvidia-persistenced Waiting ? 4s ago Error running Containerd(ext-nvidia-persistenced), going to restart forever: task "ext-nvidia-persistenced" failed: exit code 1 (last log "2025/11/24 19:55:06 nvidia-persistenced-wrapper: error starting nvidia-persistenced: fork/exec /usr/local/bin/nvidia-persistenced: no such file or directory")
10.2.2.96 kubelet Running OK 133h27m36s ago Health check successful
10.2.2.96 machined Running OK 133h27m50s ago Health check successful
10.2.2.96 syslogd Running OK 133h27m49s ago Health check successful
10.2.2.96 udevd Running OK 133h27m48s ago Health check successful
Yep, that error is exactly what I've seen. The GPU still loads and is available, but I have seen Talos reboot/rollback after an upgrade because the service never starts. I think this is blocking Talos from becoming healthy/ready and tries to recover to the old partition.
I'm not exactly sure if that's what's happening, but what I assume to be the case.
I also attempted to bundle the glibc extension to attempt to resolve the issue, but no changes.
Just checking - will the 1.12 release work out of the box for the issues mentioned in this Issue? Just checking in if that's planned -- we're currently blocked on clustering these devices with 1.11. Thanks :)
Yes, the spark works out of the box with Talos 1.12. You need to add the system extensions and patch the machine to load the kernel modules like you do for any NVIDIA hardware.