Test Nvidia RTX 4070 Ti
I have an ASUS ProArt 4070 Ti 12GB that I would like to test on the Raspberry Pi CM5 16GB.
It should work like other Nvidia cards using mariobalanica's patch to the open GPU kernel module, though without display output for now.
lspci output:
0001:01:00.0 VGA compatible controller: NVIDIA Corporation AD104 [GeForce RTX 4070 Ti] (rev a1) (prog-if 00 [VGA controller])
Subsystem: ASUSTeK Computer Inc. Device 8902
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 189
Region 0: Memory at 1b80000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 1800000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 1810000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at 1b81000000 [virtual] [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 000000fffffff000 Data: 0008
Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ TEE-IO-
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us
L1SubCtl2: T_PwrOn=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [bb0 v1] Physical Resizable BAR
BAR 0: current size: 16MB, supported: 16MB
BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
BAR 3: current size: 32MB, supported: 32MB
Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
Capabilities: [d00 v1] Lane Margining at the Receiver
PortCap: Uses Driver+
PortSta: MargReady- MargSoftReady-
Capabilities: [e00 v1] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nvidia_drm, nvidia
0001:01:00.1 Audio device: NVIDIA Corporation AD104 High Definition Audio Controller (rev a1)
Subsystem: ASUSTeK Computer Inc. Device 8902
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin B routed to IRQ 0
Region 0: Memory at 1b81080000 (32-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [78] Express (v2) Endpoint, IntMsgNum 0
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W TEE-IO-
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <4us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
AtomicOpsCtl: ReqEn-
IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [160 v1] Data Link Feature <?>
And after installing the driver, here's nvidia-smi:
jgeerling@cm5:~ $ nvidia-smi
Thu Dec 11 11:16:18 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti Off | 00000001:01:00.0 Off | N/A |
| 0% 35C P8 6W / 285W | 1MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Some AI benchmarks, since those work nicely without a display, at least: https://github.com/geerlingguy/ai-benchmarks/issues/45
I'd like to test video transcoding capabilities, with both ffmpeg directly, and maybe Jellyfin...
For some repeatable benchmarks, I'm looking at encoder-benchmark. Here's how I'm trying to set it up:
# Install rust/cargo
curl https://sh.rustup.rs -sSf | sh
# Build the benchmark
cd Downloads
git clone https://github.com/Proryanator/encoder-benchmark.git
cd encoder-benchmark
cargo build --release
# Manually download source files (I chose `4k-60.y4m`, `1080-60.y4m`, and `720-60.y4m`):
https://utsacloud-my.sharepoint.com/personal/hlz000_utsa_edu/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fhlz000%5Futsa%5Fedu%2FDocuments%2FEncoding%20Files&ga=1
# Run the permutor-cli
./target/release/permutor-cli
# Or run the benchmark
./target/release/benchmark
I opened an issue in that repo, however: https://github.com/Proryanator/encoder-benchmark/issues/80 - because I can't get it to use video files in a separate directory.
In lieu of that, I did one manual run with:
cd /home/jgeerling/Videos && \
time ffmpeg -i 720-60.y4m -c:v h264_nvenc -pix_fmt yuv420p -movflags +faststart 720-60.mp4 && \
time ffmpeg -i 1080-60.y4m -c:v h264_nvenc -pix_fmt yuv420p -movflags +faststart 1080-60.mp4 && \
time ffmpeg -i 4k-60.y4m -c:v h264_nvenc -pix_fmt yuv420p -movflags +faststart 4k-60.mp4
However, that only shows the last FPS value, not the average calculation. So I figured out how to get the encoder-benchmark going (just put all the video files directly in that project's folder, heh), and now:
CM5 16GB with 4070 Ti
| Video file | File size | Time (sec) | Average fps |
|---|---|---|---|
| 720-60.y4m | 2.4G | 4 | 438 |
| 1080-60.y4m | 5.3G | 15 | 122 |
| 4k-60.y4m | 11G | 62.206 | 30 |
Full test data:
[Resolution: 1280x720]
[Encoder: h264_nvenc]
[FPS: 60]
[Bitrate: 10Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
[00:00:04] [####################################################################################] 1800/1800 frames (00:00:00)
Average FPS: 437
1%'ile: 236
90%'ile: 470
Benchmark runtime: 4s
[Resolution: 1920x1080]
[Encoder: h264_nvenc]
[FPS: 60]
[Bitrate: 20Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
[00:00:14] [####################################################################################] 1800/1800 frames (00:00:00)
Average FPS: 122
1%'ile: 80
90%'ile: 130
Benchmark runtime: 15s
[Resolution: 3840x2160]
[Encoder: h264_nvenc]
[FPS: 60]
[Bitrate: 55Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
[00:00:58] [####################################################################################] 1800/1800 frames (00:00:00)
Average FPS: 30
1%'ile: 22
90%'ile: 32
Benchmark runtime: 1m5s
It looks like the PCIe Gen 3 lane was maxed out in bandwidth (around 850 MB/sec) during much of the 4k-60 run:
Intel Core Ultra 265K PC with 4070 Ti
| Video file | File size | Time (sec) | Average fps |
|---|---|---|---|
| 720-60.y4m | 2.4G | 1 | 1206 |
| 1080-60.y4m | 5.3G | 2 | 630 |
| 4k-60.y4m | 11G | 11 | 169 |
Full test data:
[Resolution: 1280x720]
[Encoder: h264_nvenc]
[FPS: 60]
[Bitrate: 10Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
[00:00:01] [#############################################################################] 1800/1800 frames (00:00:00)
Average FPS: 1206
1%'ile: 986
90%'ile: 1506
Benchmark runtime: 1s
[Resolution: 1920x1080]
[Encoder: h264_nvenc]
[FPS: 60]
[Bitrate: 20Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
[00:00:02] [#############################################################################] 1800/1800 frames (00:00:00)
Average FPS: 630
1%'ile: 464
90%'ile: 684
Benchmark runtime: 2s
[Resolution: 3840x2160]
[Encoder: h264_nvenc]
[FPS: 60]
[Bitrate: 55Mb/s]
[-preset p1 -tune ll -profile:v high -rc cbr -cbr true -gpu 0]
[00:00:10] [#############################################################################] 1800/1800 frames (00:00:00)
Average FPS: 169
1%'ile: 140
90%'ile: 172
Benchmark runtime: 11s
Running Jellyfin (after using its official installer), I just set it to use hardware encode (Nvidia NVENC), copied a few movies over, and could play at least two at a time (a 4K and 1080p high bitrate source file), transcoding through the GPU. Probably more, depending on the movie files:
Power consumption with this setup hovered around 29W idle, peaking around 100W during dual stream transcoding: