pkgs icon indicating copy to clipboard operation
pkgs copied to clipboard

[WIP] feat: add tenstorrent package

Open rothgar opened this issue 8 months ago • 19 comments

Add kernel module for tenstorrent hardware.

I'm a bit stuck on this because the make build isn't working. Still figuring out what I'm missing so feedback welcome.

rothgar avatar Apr 11 '25 20:04 rothgar

I think I have something wrong with my build cache. I tried adding TARGET_ARGS='--no-cache' but I still get an error about missing /tmp/build directory.

 => ERROR tenstorrent:build-0
------
 > tenstorrent:build-0:
0.063 make -C /lib/modules/6.13.6-200.fc41.x86_64/build M=/tmp/build modules
0.063 make[1]: Entering directory '/tmp/build'
0.063 make[1]: Leaving directory '/tmp/build'
0.063 make[1]: *** /lib/modules/6.13.6-200.fc41.x86_64/build: No such file or directory.  Stop.
0.063 make: *** [Makefile:15: modules] Error 2

rothgar avatar Apr 11 '25 21:04 rothgar

I think this fc41 is something badly hardcoded? why is it Fedora Core?

smira avatar Apr 14 '25 08:04 smira

I'm going to go out on a limb here: you are attempting to build on an F41 machine without having the corresponding kernel-devel package installed?

if you are on F41: rpm -qa | grep "^kernel-devel" and you'll see what you've got.

warthog9 avatar Apr 15 '25 16:04 warthog9

I think I have something wrong with my build cache. I tried adding TARGET_ARGS='--no-cache' but I still get an error about missing /tmp/build directory.

 => ERROR tenstorrent:build-0
------
 > tenstorrent:build-0:
0.063 make -C /lib/modules/6.13.6-200.fc41.x86_64/build M=/tmp/build modules
0.063 make[1]: Entering directory '/tmp/build'
0.063 make[1]: Leaving directory '/tmp/build'
0.063 make[1]: *** /lib/modules/6.13.6-200.fc41.x86_64/build: No such file or directory.  Stop.
0.063 make: *** [Makefile:15: modules] Error 2

probably need to set KDIR https://github.com/tenstorrent/tt-kmd/blob/main/Makefile#L7

frezbo avatar Apr 15 '25 18:04 frezbo

probably need to set KDIR tenstorrent/tt-kmd@main/Makefile#L7

Is there a common KDIR we use for packages? I don't see it in other pkg configs

rothgar avatar Apr 15 '25 22:04 rothgar

probably need to set KDIR tenstorrent/tt-kmd@main/Makefile#L7

Is there a common KDIR we use for packages? I don't see it in other pkg configs

should be along /rootfs/usr/lib/modules/$(cat /src/include/config/kernel.release)

frezbo avatar Apr 16 '25 09:04 frezbo

The package builds properly now (I think) and it's pushed into my local registry. How do I build an ISO with with the extension for testing on a physical machine?

From what I can tell in the docs I should be able to do something like

make kernel initramfs PKG_TENSTORRENT=127.0.0.1:5005/jgarr/tenstorrent:v1.7.0-alpha.0-243-g
make imager PUSH=true IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr INSTALLER_ARCH=amd64 PLATFORM=linux/amd64
make installer PUSH=true IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr
make iso IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr

This fails on building the installer with the following error

make installer PUSH=true IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr
make[1]: Entering directory '/var/home/jgarr/src/siderolabs/talos'
v1.10.0-alpha.3-47-g8cd3c8dc7: Pulling from jgarr/imager
30fed1bc580a: Pull complete
Digest: sha256:8c757d0dc1575931f7cb3e96a63a68739c52a4e07ea31169323231c7d8c282f8
Status: Downloaded newer image for 127.0.0.1:5005/jgarr/imager:v1.10.0-alpha.3-47-g8cd3c8dc7
127.0.0.1:5005/jgarr/imager:v1.10.0-alpha.3-47-g8cd3c8dc7
skipped pulling overlay (no overlay)
profile ready:
arch: amd64
platform: metal
secureboot: false
version: v1.10.0-alpha.3-47-g8cd3c8dc7
input:
  kernel:
    path: /usr/install/amd64/vmlinuz
  initramfs:
    path: /usr/install/amd64/initramfs.xz
  sdStub:
    path: /usr/install/amd64/systemd-stub.efi
  sdBoot:
    path: /usr/install/amd64/systemd-boot.efi
  baseInstaller:
    imageRef: 127.0.0.1:5005/jgarr/installer-base:v1.10.0-alpha.3-47-g8cd3c8dc7
output:
  kind: installer
  outFormat: raw
skipped initramfs rebuild (no system extensions)
kernel command line: talos.platform=metal console=tty0 init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 printk.devkmsg=on ima_template=ima-ng ima_appraise=fi
x ima_hash=sha512 selinux=1
UKI ready
◲ error pulling image 127.0.0.1:5005/jgarr/installer-base:v1.10.0-alpha.3-47-g8cd3c8dc7: GET http://127.0.0.1:5005/v2/jgarr/installer-base/manifests/v1.10.0-alpha.3-47-g8cd3c8dc7: MANIFEST_UN
KNOWN: manifest unknown; map[Tag:v1.10.0-alpha.3-47-g8cd3c8dc7]
Error: error pulling image 127.0.0.1:5005/jgarr/installer-base:v1.10.0-alpha.3-47-g8cd3c8dc7: GET http://127.0.0.1:5005/v2/jgarr/installer-base/manifests/v1.10.0-alpha.3-47-g8cd3c8dc7: MANIFE
ST_UNKNOWN: manifest unknown; map[Tag:v1.10.0-alpha.3-47-g8cd3c8dc7]
make[1]: *** [Makefile:454: image-installer] Error 1
make[1]: Leaving directory '/var/home/jgarr/src/siderolabs/talos'
make: *** [Makefile:475: installer] Error 2```

rothgar avatar Apr 16 '25 21:04 rothgar

Actually, looking at the content of the container image that I built it doesn't look like the kernel modules were added to the container so I'm definitely missing something in the build process.

image

rothgar avatar Apr 16 '25 23:04 rothgar

The package builds properly now (I think) and it's pushed into my local registry. How do I build an ISO with with the extension for testing on a physical machine?

From what I can tell in the docs I should be able to do something like

make kernel initramfs PKG_TENSTORRENT=127.0.0.1:5005/jgarr/tenstorrent:v1.7.0-alpha.0-243-g
make imager PUSH=true IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr INSTALLER_ARCH=amd64 PLATFORM=linux/amd64
make installer PUSH=true IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr
make iso IMAGE_REGISTRY=127.0.0.1:5005 USERNAME=jgarr

This seems to be 1.9 docs. For 1.10, need to build make installer-base before make installer.

Not sure what PKG_TENSTORRENT is supposed to mean in this context? (you have some changes you haven't show us?)

How would a new module get into this build? It should be either packaged as a system extension, or Talos source should be modified to unconditionally include it.

Either way, a PKG_KERNEL should be in the mix to make Talos use your base Linux kernel/modules, which you want to mix with your custom extension.

smira avatar Apr 17 '25 09:04 smira

Does anyone know what modules.* files we actually need to include in the build? I'm not familiar with any of these files so I'm not sure which need to be included.

_out
├── etc
│   └── udev
│       └── rules.d
│           └── 50-tenstorrent.rules
└── usr
    └── lib
        └── modules
            └── 6.12.23-talos
                ├── extras
                │   └── tenstorrent.ko
                ├── modules.alias
                ├── modules.alias.bin
                ├── modules.builtin.alias.bin
                ├── modules.builtin.bin
                ├── modules.dep
                ├── modules.dep.bin
                ├── modules.devname
                ├── modules.softdep
                ├── modules.symbols
                ├── modules.symbols.bin
                └── modules.weakdep

All the other examples I found only included modules.order, modules.builtin, and modules.builtin.modinfo

rothgar avatar Apr 18 '25 22:04 rothgar

All the other examples I found only included modules.order, modules.builtin, and modules.builtin.modinfo

It doesn't really matter, as the modules database will be rebuilt when the extension is included into the final image.

smira avatar Apr 22 '25 08:04 smira

I'm going to document my steps here before I forget them.

I have this pkg working and the kernel module loads but the tenstorrent card is not detected (or at least it doesn't show up in /dev/tenstorrent/* I'm not sure exactly why.

I used this branch and built my kernel and tenstorrent pkg

make kernel tenstorrent REGISTRY=127.0.0.1:5005 PUSH=true PLATFORM=linux/amd64

I wrote down the 2 images that were pushed to my local registry

Then I went to the extensions repo with https://github.com/siderolabs/extensions/pull/670 and built the extension with

make tenstorrent REGISTRY=127.0.0.1:5005 PUSH=true PLATFORM=linux/amd64 \
   PKG_KERNEL=127.0.0.1:5005/jgarr/kernel:v1.7.0-alpha.0-250-g28491b7-dirty@sha256:039f24ff363517f0d49adef68b749ff2ccc43c19f587d881a7c7e65c9cfc9fb8

(the tenstorrent pkg image was added to pkg.yaml for the build)

Then I built an imager image

make imager PLATFORM=linux/amd64 INSTALLER_ARCH=amd64 PUSH=true REGISTRY=127.0.0.1:5005 \
   PKG_KERNEL=127.0.0.1:5005/jgarr/kernel:v1.7.0-alpha.0-250-g28491b7-dirty@sha256:039f24ff363517f0d49adef68b749ff2ccc43c19f587d881a7c7e65c9cfc9fb8

Then I created a profile image for imager to build an installer

# profile.yaml
arch: amd64
platform: metal
secureboot: false
version: v1.10.0
input:
  kernel:
    path: /usr/install/amd64/vmlinuz
  initramfs:
    path: /usr/install/amd64/initramfs.xz
  baseInstaller:
    imageRef: ghcr.io/siderolabs/installer:v1.10.0
  systemExtensions:
    - tarballPath: /tenstorrent.tar
output:
  kind: installer
  outFormat: raw

And I built it with

cat profile.yaml | docker run --rm -i \
  -v $PWD/_out:/out -v $PWD/tenstorrent.tar:/tenstorrent.tar \
  127.0.0.1:5005/jgarr/imager:v1.10.0-alpha.3-99-gb3b20eff3@sha256:36c005ce37908245238eb6a604a6dc05a504336d88dcf83dd3bf934847572e4c -

This spit out a _out/installer-amd64.tar file which I then imported into docker and pushed to a registry

docker load -i ./_out/installer-amd64.tar
docker tag ghcr.io/siderolabs/installer:v1.10.0 rothgar/tt-installer:v1.10.0
docker push rothgar/tt-installer:v1.10.0

Then I booted talos from a generic ISO and generated the config with

talosctl gen config --install-disk /dev/nvme0n1 \
  --install-image rothgar/tt-installer:v1.10.0 \
  mini https://192.168.7.40:6443

And created a patch with

machine:
  kernel:
    modules:
      - name: tenstorrent

Then I applied the install

talosctl apply -f controlplane.yaml -i -p '@tenstorrent.yaml' -n 192.168.7.40

And I was able to see the kernel module is loaded

192.168.7.40: kern: warning: [2025-05-02T21:36:23.560178617Z]: tenstorrent: loading out-of-tree module taints kernel.
192.168.7.40: kern:    info: [2025-05-02T21:36:23.566441617Z]: Loading Tenstorrent AI driver module v1.33.

This is my first time trying to build a package and extension. Please let me know if I did any of these steps wrong or there is a way to do this with fewer steps.

rothgar avatar May 02 '25 22:05 rothgar

Then I created a profile image for imager to build an installer

from here, you just need to do a make installer with PKG_KERNEL set, well first build installer-base and imager first

I have this pkg working and the kernel module loads but the tenstorrent card is not detected (or at least it doesn't show up in /dev/tenstorrent/*

Probably needs some udev rules

frezbo avatar May 03 '25 03:05 frezbo

Probably needs some udev rules

I'm including the udev rules from their repo. https://github.com/tenstorrent/tt-kmd/blob/main/udev-50-tenstorrent.rules Have suggestions on other rules I should look at to include?

rothgar avatar May 04 '25 01:05 rothgar

Probably needs some udev rules

I'm including the udev rules from their repo. https://github.com/tenstorrent/tt-kmd/blob/main/udev-50-tenstorrent.rules Have suggestions on other rules I should look at to include?

That should be the right one, where did you put that in the extension?

frezbo avatar May 04 '25 01:05 frezbo

In the extension it’s in /rootfs/etc/udev/rules.d/

rothgar avatar May 04 '25 01:05 rothgar

In the extension it’s in /rootfs/etc/udev/rules.d/

that seems correct

frezbo avatar May 04 '25 02:05 frezbo

Looks like my problem was power related. Bought a more powerful power supply and the device shows up now and looks to be working as intended.

image

talosctl get extensions
NODE           NAMESPACE   TYPE              ID            VERSION   NAME          VERSION
192.168.4.20   runtime     ExtensionStatus   0             1         tenstorrent   1.33
192.168.4.20   runtime     ExtensionStatus   modules.dep   1         modules.dep   6.12.25-talos
talosctl list /dev/tenstorrent
NODE           NAME
192.168.4.20   .
192.168.4.20   0

They had a newer release so I'm going to bump the version in the package and test with the latest version of Talos. Is there anything else I should update before we merge it?

rothgar avatar May 19 '25 23:05 rothgar

Looks like my problem was power related. Bought a more powerful power supply and the device shows up now and looks to be working as intended.

image

talosctl get extensions
NODE           NAMESPACE   TYPE              ID            VERSION   NAME          VERSION
192.168.4.20   runtime     ExtensionStatus   0             1         tenstorrent   1.33
192.168.4.20   runtime     ExtensionStatus   modules.dep   1         modules.dep   6.12.25-talos
talosctl list /dev/tenstorrent
NODE           NAME
192.168.4.20   .
192.168.4.20   0

They had a newer release so I'm going to bump the version in the package and test with the latest version of Talos. Is there anything else I should update before we merge it?

Cool, mostly seems good, rest can fixup once out of draft

frezbo avatar May 20 '25 03:05 frezbo

/m

frezbo avatar May 21 '25 04:05 frezbo