raspberry-pi-pcie-devices icon indicating copy to clipboard operation
raspberry-pi-pcie-devices copied to clipboard

Test Nvidia RTX 4070 Ti

Open geerlingguy opened this issue 1 week ago • 5 comments

I have an ASUS ProArt 4070 Ti 12GB that I would like to test on the Raspberry Pi CM5 16GB.

Image

It should work like other Nvidia cards using mariobalanica's patch to the open GPU kernel module, though without display output for now.

geerlingguy avatar Dec 11 '25 17:12 geerlingguy

lspci output:

0001:01:00.0 VGA compatible controller: NVIDIA Corporation AD104 [GeForce RTX 4070 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: ASUSTeK Computer Inc. Device 8902
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 189
	Region 0: Memory at 1b80000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 1800000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at 1810000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at 1b81000000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 000000fffffff000  Data: 0008
	Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ TEE-IO-
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
			 AtomicOpsCtl: ReqEn-
			 IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
			ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver
		PortCap: Uses Driver+
		PortSta: MargReady- MargSoftReady-
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidia_drm, nvidia

0001:01:00.1 Audio device: NVIDIA Corporation AD104 High Definition Audio Controller (rev a1)
	Subsystem: ASUSTeK Computer Inc. Device 8902
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin B routed to IRQ 0
	Region 0: Memory at 1b81080000 (32-bit, non-prefetchable) [disabled] [size=16K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, IntMsgNum 0
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W TEE-IO-
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x1 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
			 AtomicOpsCtl: ReqEn-
			 IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
			 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
			ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
			ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
			PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [160 v1] Data Link Feature <?>

And after installing the driver, here's nvidia-smi:

jgeerling@cm5:~ $ nvidia-smi
Thu Dec 11 11:16:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     Off |   00000001:01:00.0 Off |                  N/A |
|  0%   35C    P8              6W /  285W |       1MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

geerlingguy avatar Dec 11 '25 17:12 geerlingguy

Some AI benchmarks, since those work nicely without a display, at least: https://github.com/geerlingguy/ai-benchmarks/issues/45

geerlingguy avatar Dec 11 '25 17:12 geerlingguy

I'd like to test video transcoding capabilities, with both ffmpeg directly, and maybe Jellyfin...

For some repeatable benchmarks, I'm looking at encoder-benchmark. Here's how I'm trying to set it up:

# Install rust/cargo
curl https://sh.rustup.rs -sSf | sh

# Build the benchmark
cd Downloads
git clone https://github.com/Proryanator/encoder-benchmark.git
cd encoder-benchmark
cargo build --release

# Manually download source files (I chose `4k-60.y4m`, `1080-60.y4m`, and `720-60.y4m`):
https://utsacloud-my.sharepoint.com/personal/hlz000_utsa_edu/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhlz000%5Futsa%5Fedu%2FDocuments%2FEncoding%20Files&ga=1

# Run the permutor-cli
./target/release/permutor-cli

# Or run the benchmark
./target/release/benchmark

I opened an issue in that repo, however: https://github.com/Proryanator/encoder-benchmark/issues/80 - because I can't get it to use video files in a separate directory.

geerlingguy avatar Dec 11 '25 17:12 geerlingguy

In lieu of that, I did one manual run with:

cd /home/jgeerling/Videos && \
time ffmpeg -i 720-60.y4m -c:v h264_nvenc -pix_fmt yuv420p -movflags +faststart 720-60.mp4 && \
time ffmpeg -i 1080-60.y4m -c:v h264_nvenc -pix_fmt yuv420p -movflags +faststart 1080-60.mp4 && \
time ffmpeg -i 4k-60.y4m -c:v h264_nvenc -pix_fmt yuv420p -movflags +faststart 4k-60.mp4

However, that only shows the last FPS value, not the average calculation. So I figured out how to get the encoder-benchmark going (just put all the video files directly in that project's folder, heh), and now:

CM5 16GB with 4070 Ti

Video file File size Time (sec) Average fps
720-60.y4m 2.4G 4 438
1080-60.y4m 5.3G 15 122
4k-60.y4m 11G 62.206 30

Full test data:

[Resolution:	1280x720]
[Encoder:	h264_nvenc]
[FPS:		60]
[Bitrate:	10Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
  [00:00:04] [####################################################################################] 1800/1800 frames (00:00:00)
  Average FPS:	437
  1%'ile:	236
  90%'ile:	470

Benchmark runtime: 4s

[Resolution:	1920x1080]
[Encoder:	h264_nvenc]
[FPS:		60]
[Bitrate:	20Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
  [00:00:14] [####################################################################################] 1800/1800 frames (00:00:00)
  Average FPS:	122
  1%'ile:	80
  90%'ile:	130

Benchmark runtime: 15s

[Resolution:	3840x2160]
[Encoder:	h264_nvenc]
[FPS:		60]
[Bitrate:	55Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
  [00:00:58] [####################################################################################] 1800/1800 frames (00:00:00)
  Average FPS:	30
  1%'ile:	22
  90%'ile:	32

Benchmark runtime: 1m5s

It looks like the PCIe Gen 3 lane was maxed out in bandwidth (around 850 MB/sec) during much of the 4k-60 run:

Image

Intel Core Ultra 265K PC with 4070 Ti

Video file File size Time (sec) Average fps
720-60.y4m 2.4G 1 1206
1080-60.y4m 5.3G 2 630
4k-60.y4m 11G 11 169

Full test data:

[Resolution:	1280x720]
[Encoder:	h264_nvenc]
[FPS:		60]
[Bitrate:	10Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
  [00:00:01] [#############################################################################] 1800/1800 frames (00:00:00)
  Average FPS:	1206
  1%'ile:	986
  90%'ile:	1506

Benchmark runtime: 1s

[Resolution:	1920x1080]
[Encoder:	h264_nvenc]
[FPS:		60]
[Bitrate:	20Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
  [00:00:02] [#############################################################################] 1800/1800 frames (00:00:00)
  Average FPS:	630
  1%'ile:	464
  90%'ile:	684

Benchmark runtime: 2s

[Resolution:	3840x2160]
[Encoder:	h264_nvenc]
[FPS:		60]
[Bitrate:	55Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
  [00:00:10] [#############################################################################] 1800/1800 frames (00:00:00)
  Average FPS:	169
  1%'ile:	140
  90%'ile:	172

Benchmark runtime: 11s

geerlingguy avatar Dec 11 '25 18:12 geerlingguy

Running Jellyfin (after using its official installer), I just set it to use hardware encode (Nvidia NVENC), copied a few movies over, and could play at least two at a time (a 4K and 1080p high bitrate source file), transcoding through the GPU. Probably more, depending on the movie files:

Image

Image

Power consumption with this setup hovered around 29W idle, peaking around 100W during dual stream transcoding:

Image

geerlingguy avatar Dec 11 '25 19:12 geerlingguy