raspberry-pi-pcie-devices Test x4 RTX5000 Ada 32GB on RPI5 (using CDI)

1 GPU? ... I could never. How about 4 GPUs? Is that good enough for you Jeff? Oh, and RTX 5000 ADA you say? 😜

Jokes aside... I'm fortunate to have a few things going my way that have allowed me to push the RPI5 to an extreme that I think at the moment is unmatched.

Mostly untethered access to a collection of high end prosumer and datacenter NVIDIA GPUs (RTX5000 ADA, A100, etc.)
Access to a unique CDI (Composable Disaggregated Infrastructure) solution -- in this case allowing for multiple GPUs on an external PCIe switch to be composed to a single PCIe device attached to the RPI5. Think similar to how a retimer/HBA card would work, but instead of extending the PCIe domain from the external PCIe switch all the way to the host, the PCIe domain is capped at the ends and devices are controlled through a software based PCIe switch. This allows for an unparalleled amount of control and agility in what is possible and allows me to get around hurdles that I think at the moment are unpassable if someone was to try to use retimer cards + external PCIe switches with a Raspberry Pi.

Setup:

Raspberry Pi 5 16GB
Trixie, NVIDIA 580.95.05 + CUDA Toolkit 13.0.2
m.2 to OCULINK running to Minisforum DEG1
x4 NVIDIA RTX 5000 ADA (in graphics mode, otherwise BAR sizes are 64GB, which is too large for the MMIO of the Pi)
CDI hardware + external PCIe Gen5x16 switch

Tweaks required:

Minisforum DEG1 TGX to OFF https://www.jeffgeerling.com/blog/2025/not-all-oculink-egpu-docks-are-created-equal
Force PCIe Gen3 x1
Force MSI-X in device tree https://www.jeffgeerling.com/blog/2023/how-customize-dtb-device-tree-binary-on-raspberry-pi
Mario's NVIDIA patch as outlined in https://www.jeffgeerling.com/blog/2025/nvidia-graphics-cards-work-on-pi-5-and-rockchip
After bootup: Remove 0001:00:00.0 Broadcom BCM2712 bridge and PCIe rescan. This is due to the Pi booting too fast for things to properly enumerate on bootup and not enough prefetchable memory behind the BCM2712 bridge to get allocated.
Disable ACS via setpci, setting the 06 register to all 0000. This allows for device-to-device EastWest between GPUs to flow through external PCIe switch instead of getting caught by ACS and going NorthSouth through the Raspberry Pi.

Data Dump

user@pcipi:~ $ cat /sys/firmware/devicetree/base/model
Raspberry Pi 5 Model B Rev 1.1

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000001:03:00.0 Off |                  Off |
| 30%   37C    P2             83W /  250W |   11815MiB /  32760MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 5000 Ada Gene...    Off |   00000001:04:00.0 Off |                  Off |
| 30%   40C    P2             90W /  250W |   10987MiB /  32760MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX 5000 Ada Gene...    Off |   00000001:05:00.0 Off |                  Off |
| 30%   42C    P2             84W /  250W |   10987MiB /  32760MiB |     23%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX 5000 Ada Gene...    Off |   00000001:06:00.0 Off |                  Off |
| 30%   44C    P2             85W /  250W |   11685MiB /  32760MiB |     25%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

0001:06:00.0 3D controller: NVIDIA Corporation AD102GL [RTX 5000 Ada Generation] (rev a1)
        Subsystem: NVIDIA Corporation Device 17fa
        Flags: bus master, fast devsel, latency 0, IRQ 177
        Memory at 1b83000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 1850000000 (64-bit, prefetchable) [size=256M]
        Memory at 1848000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [b4] Vendor Specific Information: Len=14 <?>
        Capabilities: [100] Virtual Channel
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900] Secondary PCI Express
        Capabilities: [c1c] Physical Layer 16.0 GT/s <?>
        Capabilities: [d00] Lane Margining at the Receiver
        Capabilities: [e00] Data Link Feature <?>
        Kernel driver in use: nvidia
        Kernel modules: nvidia_drm, nvidia

llama.cpp + CUDA + Optimizations:

Decided to try with Llama-3.3-70B-Instructu-Q4_K_M.gguf as a proving point that multi-gpu is in fact working. Main requirement here was the --no-mmap flag to bypass trying to map a 43GB model to the limit memory of the Pi. The model takes ~10 minutes to load, I believe at least in part to the microsd storage bottleneck.

For a comparison, I also have an identical setup where x4 RTX 5000 ADA are physically inside of a Intel based server running Ubuntu 22.04. The model on this loads in 30 seconds.

Using:

./build/bin/llama-bench -p 512,1024,4096,8192 -n 512,1024,4096,8192 -m models/Llama-3.3-70B-Instruct-Q4_K_M.gguf --mmap 0 -ngl 999

Raspberry Pi:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 3: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |           pp512 |        675.23 ± 1.18 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          pp1024 |        969.67 ± 1.39 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          pp4096 |       1013.32 ± 1.08 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          pp8192 |        863.25 ± 0.53 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |           tg512 |         11.82 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          tg1024 |         11.83 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          tg4096 |         11.66 ± 0.00 |

Intel Server (Control):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 3: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |           pp512 |        838.20 ± 1.86 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          pp1024 |       1216.27 ± 1.66 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          pp4096 |       1255.30 ± 1.04 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          pp8192 |       1032.52 ± 1.35 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |           tg512 |         12.00 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          tg1024 |         12.02 ± 0.00 |
| llama 70B Q4_K - Medium        |  39.59 GiB |    70.55 B | CUDA       | 999 |    0 |          tg4096 |         11.84 ± 0.00 |

(ran out of time to wait for tg8192 results to finish)

Conclusion

Very speechless by these results. Shocked I was able to get within percents of GPUs in a full fledged high-end server. This is all possible because of the specials CDI hardware that I have at my disposal. The 4 GPUs being on an external PCIe Gen5 switch and being able to send EastWest traffic mean the Pi is essentially doing nothing.

What should I do with this setup next? Any suggestions? Benchmarks?

Dec 06 '25 07:12 mpsparrow

Forgot to attach some photos

Raspberry Pi + CDI card

External PCIe switch in a chassis + CDI card + 4 RTX 5000 Ada GPUs

The card on the Minisforum dock and the card on the PCIe switch in the chassis are connected optically through the green MTP cables.

Dec 06 '25 07:12 mpsparrow

Hahaha this is absolutely my kind of mad.

I'm actually working on a similar setup (but much less powerful, just aiming for two GPUs), using a Dolphin PCIe switch externally... haven't had time to get it going yet, but with all the other successes, I was hoping it'd be this simple. And it looks like it is, if your results are anything to go by.

One thing to test: Do you have a USB 3.0 SSD or a fast USB3 SSD-based thumb drive like the SanDisk Extreme PRO that I use? If so it's like 20x faster than microSD. Extremely handy for compilation, copying models, etc. And should give you 250-300 MB/sec throughput versus the best being in the 50-100 MB/sec on any premium microSD card.

Are you okay with me mentioning your setup in a video that I'm going to be posting about the multi-GPU setup? And if you'd like and are willing, you could send over a couple video clips of it running and I could put that in as well.

Dec 06 '25 16:12 geerlingguy

Nice, and here's testing an A4000 + A400, inspired by your setup...

jgeerling@cm5:~ $ nvidia-smi
Sat Dec  6 13:13:39 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A400                Off |   00000001:03:00.0 Off |                  N/A |
| 30%   27C    P8            N/A  /   50W |       1MiB /   4094MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A4000               Off |   00000001:04:00.0 Off |                  Off |
| 41%   26C    P8              4W /  140W |       1MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I'm also able to get a 4070 Ti + RTX A4000, but for some reason I can't pair up the RTX 3060 with any of these cards, it doesn't show when I have it in either slot on the Dolphin board.

Haven't run anything on these setups yet but hopefully that's not too difficult.

Dec 06 '25 19:12 geerlingguy

@mpsparrow - Ah...

After bootup:

Remove 0001:00:00.0 Broadcom BCM2712 bridge and PCIe rescan. This is due to the Pi booting too fast for things to properly enumerate on bootup and not enough prefetchable memory behind the BCM2712 bridge to get allocated.

Disable ACS via setpci, setting the 06 register to all 0000. This allows for device-to-device EastWest between GPUs to flow through external PCIe switch instead of getting caught by ACS and going NorthSouth through the Raspberry Pi.

Maybe that's the issue on the RTX 3060. I may test that later.

For disabling ACS, do you have the commands you used? Would like to document that, I haven't ever tried getting the devices going through the switch itself, but that was also something the rep from Dolphin told me to try.

Dec 06 '25 19:12 geerlingguy

One thing to test: Do you have a USB 3.0 SSD or a fast USB3 SSD-based thumb drive like the SanDisk Extreme PRO that I use? If so it's like 20x faster than microSD. Extremely handy for compilation, copying models, etc. And should give you 250-300 MB/sec throughput versus the best being in the 50-100 MB/sec on any premium microSD card.

Thanks for the suggestion. I did briefly try a SATA SSD using a SATA to USB adapter, but was getting fdisk formatting errors. You've reminded me that I have a SABRENT m.2 to USB enclosure and some WD SN850X laying around. Will try this next time I'm in the office.

Are you okay with me mentioning your setup in a video that I'm going to be posting about the multi-GPU setup? And if you'd like and are willing, you could send over a couple video clips of it running and I could put that in as well.

You have my full permission to do so. I'll capture some additional pictures/videos along with more details about the setup and Cerio CDI hardware and send it your way. I'm out of office this weekend, but expect this sometime Monday/Tuesday if that works for you.

Nice, and here's testing an A4000 + A400, inspired by your setup...

Lovely! Welcome to the multi-gpu Pi club 😃

I'm actually working on a similar setup (but much less powerful, just aiming for two GPUs), using a Dolphin PCIe switch externally...

Very curious what type of Dolphin hardware you are running with -- I'm not familiar with their product line. Not sure if you are able to share those details or if I have to wait for the next video 😉.

For disabling ACS, do you have the commands you used? Would like to document that, I haven't ever tried getting the devices going through the switch itself, but that was also something the rep from Dolphin told me to try.

To give a bit of an explanation: ACS (Access Control Service) is a security and isolation feature that will block or redirect TLPs depending on the existence and values set in the ACS Control Register and ACS Capability Register for each device. This can cause peer to peer traffic (GPU to GPU in our case) to get routed upstream through the root complex instead of staying in the PCIe switch. It doesn't appear that the Raspberry Pi has ACS capabilities for any of its devices, so the only devices that should need to get special treatment is any Dolphin/Cerio CDI/external switch hardware we have attached to the Pi.

In simple terms, the peer to peer traffic will look like: ACS=enabled: GPU <-> PCIe switch <-> root complex <-> PCIe switch <-> GPU ACS=disabled: GPU <-> PCIe switch <-> GPU

Getting routed to the root complex means pushing all traffic NorthSouth through the PCIe Gen3 x1 Pi link, which is obviously bad for performance and adds a large latency component. For this reason, ACS=disabled is the only viable way to get truly good performance out of such a multi-gpu setup.

For checking the ACS values -- from a sudo lspci -vv each device supporting ACS should have ACSCap (ACS Capability Register) and ACSCtl (ACS Control Register). These will be the devices we want to disable ACS on. Setting the ACSCap to 0000 is enough to force ACS off for the given device.

ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Referencing the PCIe spec for the register and offset:

Read the current ACS Control Register value for a given BDF:

sudo setpci -s XXXX:XX:XX.X ECAP_ACS+6.W

Set the ACS Control Register value to 0000 for a given BDF (ACS=disabled):

sudo setpci -s XXXX:XX:XX.X ECAP_ACS+6.W=0000

These values will reset on a system reboot.

It is useful to benchmark the device-to-device performance after ACS=disabled to confirm the speeds are in-line with the speed of the PCIe switch connecting the GPUs. nvbandwidth is probably best for this using a test like device_to_device_memcpy_write_ce (conveniently outlined in the usage section of the README).

As an example with my setup: ACS=enabled: ~700MB/s device to device (PCIe Gen3 x1 speeds from the Pi) ACS=disabled: ~26500MB/s device to device (PCIe Gen4 x16 speeds which is the max of the PCIe Gen4 GPUs)

I'm also able to get a 4070 Ti + RTX A4000, but for some reason I can't pair up the RTX 3060 with any of these cards, it doesn't show when I have it in either slot on the Dolphin board. ... Maybe that's the issue on the RTX 3060. I may test that later.

Interesting problem. Maybe related to my issues -- on first bootup I don't get enough Prefetchable memory behind bridge allocated on 0001:01:00.0 meaning the GPUs downstream don't get memory assigned and fail to properly function and will NOT get attached to the NVIDIA driver.

My basic fix is that after bootup I run:

echo 1 | sudo tee /sys/bus/pci/devices/0001:01:00.0/remove
echo 1 | sudo tee /sys/bus/pci/rescan

This results in memory behind bridge getting correctly allocated:

0001:01:00.0 PCI bridge: Cerio Emulated PCIe Switch (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 37
        Bus: primary=01, secondary=02, subordinate=06, sec-latency=0
        Memory behind bridge: 80000000-83ffffff [size=64M] [32-bit]
        Prefetchable memory behind bridge: 1800000000-185fffffff [size=1536M] [32-bit]

And downstream GPUs getting memory assigned:

0001:03:00.0 3D controller: NVIDIA Corporation AD102GL [RTX 5000 Ada Generation] (rev a1)
        Subsystem: NVIDIA Corporation Device 17fa
        Flags: bus master, fast devsel, latency 0, IRQ 174
        Memory at 1b80000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 1800000000 (64-bit, prefetchable) [size=256M]
        Memory at 1810000000 (64-bit, prefetchable) [size=32M]

Can tell a lot from dmesg and lspci in these situations. Will see memory assigning errors and general NVIDIA barf for this type of issue.

I have a RTX 3070, but haven't had an opportunity to give it a spin yet. If this becomes a recurring problem I can try giving it a spin on my setup to see if it can be replicated.

Dec 07 '25 06:12 mpsparrow

Figured out my external drive issues -- was using the 15W Raspberry Pi 4 power supply by accident and it was failing to supply adequate power to the USB ports. Surprised I didn't run into other issues because of this.

With that said, using an external drive for the model isn't noticeably helping the initial model loading times. Thinking this must be a CPU buffer speed or PCIe bottleneck I am hitting.

Dec 08 '25 18:12 mpsparrow

Okay, working through the following...

Check for ACS capabilities

$ sudo lspci -vv
# Output pasted here: https://pastebin.com/1vm9geVR
# It looks like the switches list ACSCap but not the Nvidia devices...

It seems like the cards don't have anything listed. Three quick follow-up questions:

Was that how it was for the RTX A5000s in your system?
Are you using the same patched open kernel module install?
Do you get any display output with any of the GPUs?

Install CUDA (dependency for `nvbandwidth`):

wget https://developer.download.nvidia.com/compute/cuda/13.1.0/local_installers/cuda_13.1.0_590.44.01_linux_sbsa.run
sudo sh cuda_13.1.0_590.44.01_linux_sbsa.run
sudo sh cuda_13.1.0_590.44.01_linux_sbsa.run

However... I get a failure:

[INFO]: Driver not installed.
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 14.2.0 (Debian 14.2.0-19) 

[INFO]: Initializing menu
[INFO]: nvidia-fs.setKOVersion(2.27.3)
[INFO]: Setup complete
[INFO]: Installing: Driver
[INFO]: Installing: 590.44.01
[INFO]: Executing NVIDIA-Linux-aarch64-590.44.01.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd  2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed. Consult the driver log at /var/log/nvidia-installer.log for more details.
[ERROR]: Install of 590.44.01 failed, quitting

Going to try installing again, but without the 'Driver' checked. Maybe it can use my pre-installed driver?

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-13.1/

Please make sure that
 -   PATH includes /usr/local/cuda-13.1/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-13.1/lib64, or, add /usr/local/cuda-13.1/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-13.1/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 590.00 is required for CUDA 13.1 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

(Add nvcc and tooling to path with export PATH=$PATH:/usr/local/cuda-VERSION_HERE/bin.)

Looks like I need an older version. Off to the CUDA toolkit archive we go...

sudo /usr/local/cuda-13.1/bin/cuda-uninstaller
wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda_12.9.1_575.57.08_linux_sbsa.run
sudo sh cuda_12.9.1_575.57.08_linux_sbsa.run --tmpdir=/home/jgeerling/Downloads

Nope, didn't work:

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 575.00 is required for CUDA 12.9 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

Trying one more time, now that I found the version of the driver corresponds to a particular CUDA release...

wget https://developer.download.nvidia.com/compute/cuda/13.0.2/local_installers/cuda_13.0.2_580.95.05_linux_sbsa.run
sudo sh cuda_13.0.2_580.95.05_linux_sbsa.run

I still get:

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 580.00 is required for CUDA 13.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Confirm CUDA is installed correctly:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

And then using deviceQuery to get all information:

Click to expand full deviceQuery output

$ ./build/Samples/1_Utilities/deviceQuery/deviceQuery 
./build/Samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4070 Ti"
  CUDA Driver Version / Runtime Version          13.0 / 13.0
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 11874 MBytes (12450922496 bytes)
  (060) Multiprocessors, (128) CUDA Cores/MP:    7680 CUDA Cores
  GPU Max Clock rate:                            2610 MHz (2.61 GHz)
  Memory Clock rate:                             10501 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 50331648 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Device PCI Domain ID / Bus ID / location ID:   1 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA RTX A4000"
  CUDA Driver Version / Runtime Version          13.0 / 13.0
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 15974 MBytes (16750149632 bytes)
  (048) Multiprocessors, (128) CUDA Cores/MP:    6144 CUDA Cores
  GPU Max Clock rate:                            1560 MHz (1.56 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Device PCI Domain ID / Bus ID / location ID:   1 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA GeForce RTX 4070 Ti (GPU0) -> NVIDIA RTX A4000 (GPU1) : No
> Peer access from NVIDIA RTX A4000 (GPU1) -> NVIDIA GeForce RTX 4070 Ti (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.0, CUDA Runtime Version = 13.0, NumDevs = 2
Result = PASS

Set up `nvbandwidth`

# Install nvbandwidth
sudo apt install libboost-program-options-dev
cd Downloads
git clone https://github.com/NVIDIA/nvbandwidth.git
cd nvbandwidth
cmake .
make

Running this as the README states, I get:

$ cmake .
CMake Error at /usr/share/cmake-3.31/Modules/Internal/CMakeCUDAArchitecturesValidate.cmake:7 (message):
  CMAKE_CUDA_ARCHITECTURES must be non-empty if set.
Call Stack (most recent call first):
  /usr/share/cmake-3.31/Modules/CMakeDetermineCUDACompiler.cmake:112 (cmake_cuda_architectures_validate)
  CMakeLists.txt:3 (project)


CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!

But it looks like something is cached specifying a newer version of CUDA, so I had to manually specify:

cmake . -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc
make

Test bandwidth with nvbandwidth:

./nvbandwidth -t device_to_device_memcpy_write_ce

Dec 10 '25 17:12 geerlingguy

Recorded a few raw videos of running through the setup and configuration.

Video 1: Physical Hardware

Unfortunately I wasn't able disassemble the chassis containing the PCIe switch and move it outside the lab for better pictures.

There are two Cerio fabric nodes, one connected to the Minisforum dock, and a second in the upstream slot of the PCIe switch (which is inside the chassis). These two cards are linked together with a 200Gbps QSFPs running over fiber optical cable. These cards also have a Ethernet jack used for management via a Fabric Controller utility.

https://github.com/user-attachments/assets/e51be58b-d5ec-4a09-9c71-4f2a9177aa03

https://github.com/user-attachments/assets/edf0b944-5d0d-4d79-b51a-f5e7d25805e7

Video 2: Composing GPUs

Raspberry Pi, Minisforum dock, and external switch + GPUs are already powered and online.

Showing the Cerio Fabric Controller side first. This is a CLI utility running on a separate machine that is used to manage and control the Cerio Fabric Nodes being used.

List the Fabric Nodes in our network. Two nodes listed; rpi5 being the node on the Miniforums dock, and target-node being the node in the upstream slot of the external PCIe switch.
Run a device list for all devices seen by our target-node (so all devices on the external PCIe switch). This shows the 4 RTX5000 ADA GPUs.
Run a compose command, specifying the rpi5 node and the 4 GPUs that are downstream from target-node. This is a live operation that doesn't require rebooting any hardware.

Back to Raspberry Pi:

lspci shows no external PCIe devices. 2 PCIe rescans are required in this case to get all the Cerio bridge devices along with our newly composed RTX5000 ADA GPUs to show up.
lspci of the GPUs show that they have no memory regions and therefore the kernel driver hasn't binded. This is because the 0001:00:00.0 bridge doesn't have enough Prefetchable memory behind bridge at the moment to be able to allocate to these GPUs.
PCIe remove on 0001:00:00.0 and then a PCIe rescan fixes this memory issue. Prefetchable memory behind bridge is correct size. GPUs now have correct memory ranges and kernel driver is binded.

https://github.com/user-attachments/assets/a5c7a855-b8ae-44cd-9a3b-431e87c4af5a

Video 3: Configuring ACS

Run a benchmark that shows the device to device performance between 2 of the GPUs. This peaks at 768MB/s, which is around the speed of PCIe Gen3 x1 that the RPI5 provides.
List PCIe devices with ACS control/capability registers. We see 4 devices, which are external switch devices, that have ACSCtl values set to ON (+). Using setpci I set these to OFF (-).
Rerunning the benchmark now shows 26396MB/s for device to device, which is the maximum speed the GPUs are able to push through the external PCIe switch. In this case the RTX 5000 ADA GPUs are the bottleneck as the PCIe switch I'm using is PCIe Gen5 x16 but the GPUs are only PCIe Gen4 x16.

Reference my earlier comment for more specific details on ACS and why it needs to be disabled https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/791#issuecomment-3621702101

https://github.com/user-attachments/assets/d09c3dc7-6176-4f3a-9d69-2a5872204306

Dec 10 '25 18:12 mpsparrow

Amazing detail, couldn't ask for more haha! Thanks for posting this, I guess I can just disable the ACS on the switch then, and it should work. I'm working on getting CUDA going, but maybe I'll give up on that and just run my AI llama.cpp benchmarks to see if the bandwidth is substantially different. Might also try vLLM

Dec 10 '25 18:12 geerlingguy

I seem to have corrupted my CUDA installation this morning while working on getting Docker + NVIDIA Container Toolkit functional (this does work btw), so I need to reinstall stuff. Believe I did nothing special the first time around, but I'll confirm. I specifically am using the DEB package cuda-repo-ubuntu2404-13-0-local_13.0.2-580.95.05-1_arm64.deb.

Was that how it was for the RTX A5000s in your system?

Yes. The cards themselves don't have the ACS registers, but the switches underneath them have the registers. Hopefully my above comment + video helps clear that up.

Are you using the same patched open kernel module install?

Yes. I'm using mariobalanica patch using the install instructions you've made https://www.jeffgeerling.com/blog/2025/nvidia-graphics-cards-work-on-pi-5-and-rockchip

Do you get any display output with any of the GPUs?

I do not get any display out from the RTX 5000 ADA unfortunately. Same situation when I tried my RTX 3070TI the other day on the same setup. Have yet to see display out working from this.

Dec 10 '25 18:12 mpsparrow

@mpsparrow Thank you so much for the help and patience, and detailed explanations :)

I didn't realize the 13.0.x series had a match to the driver in use, so I'll try that next.

Dec 10 '25 19:12 geerlingguy

No issue on my end with the reinstall of CUDA.

Follow your guide for installing drivers and patching. I'm specifically using the run file install of NVIDIA-Linux-aarch64-580.95.05.run https://www.jeffgeerling.com/blog/2025/nvidia-graphics-cards-work-on-pi-5-and-rockchip
Use the deb local install for CUDA 13.0.2 (using exact commands from the "CUDA Toolkit Installer" section). https://developer.nvidia.com/cuda-13-0-2-download-archive?target_os=Linux&target_arch=arm64-sbsa&Compilation=Native&Distribution=Ubuntu&target_version=24.04&target_type=deb_local
Set the path variables in .bashrc

export PATH=${PATH}:/usr/local/cuda-13.0/bin
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64

nvcc --version works

user@pcipi:~ $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

Dec 10 '25 19:12 mpsparrow

@mpsparrow Thanks! I have it working (been updating the comment above, also added instructions in my blog post for quicker reference).

However, when running nvbandwidth, I am not getting a result...

./nvbandwidth -t device_to_device_memcpy_write_ce
nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 13000
CUDA Driver Version: 13000
Driver Version: 580.95.05

cm5
Device 0: NVIDIA GeForce RTX 4070 Ti (00000001:03:00)
Device 1: NVIDIA RTX A4000 (00000001:04:00)

Waived: 
NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

Is this possibly because my 4070 Ti isn't "pro" enough for this, or CUDA expects all of one type of GPU?

Dec 10 '25 19:12 geerlingguy

Ah... the 4070 Ti might just not be 'Pro' enough:

jgeerling@cm5:~/Downloads/nvbandwidth $ nvidia-smi topo -p2p r
 	GPU0	GPU1	
 GPU0	X	NS	
 GPU1	NS	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

Found via https://github.com/NVIDIA/nvbandwidth/issues/45

I might try with an A400. Or I wonder if a Quadro card would work...

Dec 10 '25 20:12 geerlingguy

Ah... the 4070 Ti might just not be 'Pro' enough:

I don't actually know if that is the case. I believe it expects the same GPU type when doing any device to device level test, or at least similar hardware with same architecture. If you try like a device_to_host_memcpy_ce nvbandwidth test I would expect that to work.

Dec 10 '25 20:12 mpsparrow

@mpsparrow - That works:

memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0         1
 0      0.82      0.82

SUM device_to_host_memcpy_ce 1.64

Is there any other way I could test device to device, easily? It would be nice to get a good clean number to verify they're talking directly, but if not, no biggie. I have your test data which is already adequate.

Getting another RTX A4000 is out of scope for me right now :D

Dec 10 '25 20:12 geerlingguy

Over in https://github.com/geerlingguy/ai-benchmarks/issues/44 I found that it looks like the switch as configured by Dolphin already had ACS disabled...

jgeerling@cm5:~ $ sudo lspci -vv | grep -E 'ACSCap|ACSCtl'
		ACSCap:	SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCap:	SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
jgeerling@cm5:~ $ sudo lspci -vt
-[0001:00]---00.0-[01-04]--+-00.0-[02-04]--+-00.0-[03]--+-00.0  NVIDIA Corporation AD104 [GeForce RTX 4070 Ti]
                           |               |            \-00.1  NVIDIA Corporation AD104 High Definition Audio Controller
                           |               \-01.0-[04]--+-00.0  NVIDIA Corporation GA104GL [RTX A4000]
                           |                            \-00.1  NVIDIA Corporation GA104 High Definition Audio Controller
                           \-00.1  Microchip Technology PM40036 Switchtec PFX 36xG4 Fanout PCIe Switch
-[0002:00]---00.0-[01]----00.0  Raspberry Pi Ltd RP1 PCIe 2.0 South Bridge
jgeerling@cm5:~ $ sudo setpci -s 0001:02:00.0 ECAP_ACS+6.W=0000
jgeerling@cm5:~ $ sudo setpci -s 0001:02:01.0 ECAP_ACS+6.W=0000
jgeerling@cm5:~ $ sudo lspci -vv | grep -E 'ACSCap|ACSCtl'
		ACSCap:	SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCap:	SrcValid- TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+ DirectTrans+
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

And after running the setpci commands (which seem to change nothing, since it's already disabled), llama.cpp results are identical regardless. But I'm not sure llama.cpp is passing memory between GPUs regardless, the way it splits up the workload.

Dec 10 '25 21:12 geerlingguy

@mpsparrow - Regarding boot enumeration, I sometimes have one device (when I'm running multiple) just not show up at all (with no obvious errors in dmesg), and I'm wondering if it's the same issue you see with your setup requiring the manual scan... is that something we could raise up to Raspberry Pi in their Linux repo that they could take a look at?

Dec 10 '25 21:12 geerlingguy

@geerlingguy I tested device-to-device with different GPU types using nvbandwidth and can confirm it does not work. A100 to A100 works, but L40S to A100 does not (this test was NOT done on a Raspberry Pi). So ultimately matching GPUs I think is needed. I also assume you may run into AI performance issues with having multiple mixed GPU scenarios and them being unable to device-to-device efficiently.

./nvbandwidth -t device_to_device_memcpy_write_ce

nvbandwidth Version: v0.8
Built from Git version: v0.8

CUDA Runtime Version: 13000
CUDA Driver Version: 13000
Driver Version: 580.65.06

Device 0: NVIDIA L40S (00000000:29:00)
Device 1: NVIDIA A100 80GB PCIe (00000000:27:00)
Device 2: NVIDIA A100 80GB PCIe (00000000:28:00)

Running device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2
 0       N/A       N/A       N/A
 1       N/A       N/A     15.51
 2       N/A     15.51       N/A

SUM device_to_device_memcpy_write_ce 31.02

@mpsparrow - Regarding boot enumeration, I sometimes have one device (when I'm running multiple) just not show up at all (with no obvious errors in dmesg), and I'm wondering if it's the same issue you see with your setup requiring the manual scan... is that something we could raise up to Raspberry Pi in their Linux repo that they could take a look at?

I'm happy to help push Raspberry Pi. The main question becomes is this an issue with Pi/Linux itself or the external PCIe hardware we are trying to use. i.e. can this be fixed by a kernel patch of some kind. I think some of our issues may get boiled down to "the Raspberry Pi boots up too dang fast" to where the external hardware hasn't fully initialized in time for the Pi's enumeration stage.

There used to be a boot_delay parameter but that appears to have gotten deprecated and my attempts at using it yielded no extra delay. If I had such a way to add delay pre kernel start I think I could narrow this down a lot easier.

Only visible on archive.org as the current page no longer lists it. https://web.archive.org/web/20230303033257/https://www.raspberrypi.com/documentation/computers/config_txt.html#boot_delay

Dec 10 '25 22:12 mpsparrow

@mpsparrow - I've opened up https://github.com/raspberrypi/linux/issues/7172 — we'll see if there's any other way of debugging further.

Dec 10 '25 23:12 geerlingguy

@mpsparrow - https://github.com/raspberrypi/linux/issues/7172#issuecomment-3642841178

@P33M suggested testing out dtparam=pcie_tperst_clk_ms=250 in /boot/firmware/config.txt (if that's helpful for the switch chip...)

$ dtparam -h pcie_tperst_clk_ms
pcie_tperst_clk_ms      Add N milliseconds between PCIe reference clock
                        activation and PERST# deassertion
                        (CM4 and 2712, default "0")

Dec 11 '25 17:12 geerlingguy

@geerlingguy

I've set the value to 10000ms.

dtparam=pcie_tperst_clk_ms=10000

This parameter appears to solve all my initial enumeration issues. I'm seeing devices show up on initial bootup whereas before I needed to do the PCIe rescan dance to get everything showing.

I've run out of time to poke at this further for today, so unable to narrow down a more optimal parameter value -- probably only need a few thousand ms instead of 10000ms.

Dec 11 '25 19:12 mpsparrow

@mpsparrow Oh awesome! I will have to put this in my toolbelt too, when I'm having enumeration issues. I might throw the 3060 back on the switch and see if that was the problem there, too.

Dec 11 '25 20:12 geerlingguy

@mpsparrow - Hmm, now I can't replicate my enumeration failure anymore. Not sure why:

jgeerling@cm5:~ $ nvidia-smi
Thu Dec 11 16:59:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000001:03:00.0 Off |                  N/A |
| 30%   30C    P0             31W /  170W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A4000               Off |   00000001:04:00.0 Off |                  Off |
| 39%   38C    P0             33W /  140W |       1MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I didn't make any other changes, just re-plugged everything in. Maybe there was something loose lol, I have no explanation. I tried two cold boots and two reboots, and every time, both cards show up now.

Dec 11 '25 23:12 geerlingguy

Test x4 RTX5000 Ada 32GB on RPI5 (using CDI)

Setup:

Tweaks required:

Data Dump

llama.cpp + CUDA + Optimizations:

Conclusion

Check for ACS capabilities

Install CUDA (dependency for nvbandwidth):

Set up nvbandwidth

Video 1: Physical Hardware

Video 2: Composing GPUs

Video 3: Configuring ACS

Install CUDA (dependency for `nvbandwidth`):

Set up `nvbandwidth`