Mesa-3D
Mesa-3D copied to clipboard
Low frame rates on Starcraft 2 while CPU and GPU utilization is low
Long story short, I observed a scenario where I get low frame rates even though the CPU and GPU utilization was low. This is a continuation of this issue where we started discussing a performance problem with Starcraft 2 when running on Gallium Nine standalone.
Hardware details: the system is a Dell XPS 13 9370 with a 4-core / 8-thread Intel 8550U, a Sapphire AMD Radeon RX 570 ITX card is connected through Thunderbolt 3 (which is roughly equivalent to a x4 PCI-E connection), and a 4K Dell U2718Q screen is connected to the RX 570's DisplayPort output.
I didn't measure the frame rates in a scientific way, but the difference is very noticable:
- On the same setup I easily get 60-80 fps on all high settings on Windows 10
- Running Gallium Nine standalone on Fedora 29's Wine 3.21 on medium settings results in about 30-40 fps ― the curlpit seems to be the shaders, if I turn it down to low shaders then it gets above 60 fps.
- If I run Wine 3.13 with the PBA patch from the espionage724 repo, then I get about 50-60 fps (again on medium settings)
Some observations from the other thread that are worth to mention here:
- The frame rate doesn't seem to be related to resolution. Setting the system to use 1080p (using Gnome settings) results in similar numbers when using Nine.
- According to
GALLIUM_HUD
CPU utilization is around 30% and GPU utilization is around 50% - According to
perf top
the majority of time is spent in functions likesi_set_constant_buffer
amdgpu_mm_rreg
andNineDevice9_SetIndices
from which I got the impression that some buffers are copied to/from the GPU and that may be a problem.
@axeldavy
On the first campain, after the videos, I get about 140 fps on full hd, everything maxed out on my radeon rx 480. The GPU load is about 90%. My 4 cpu threads are all around 60-70%.. I used GALLIUM_HUD=GPU-load,fps,cpu0+cpu1+cpu2+cpu3,shader-clock+memory-clock+temperature tearfree_discard=true WINEDEBUG=-all
That does sound awesome, and makes me think that the problem may be with my setup, or that maybe my setup triggers some sort of corner case within Nine. After thinking about it more, I've got three ideas:
Idea 1. Maybe there's something wrong with my kernel command line? Bascially what I do is disable all the spectre / meltdown mitigations, I enable the PowerPlay features in amdgpu and blacklist the i915
module.
resume=UUID=0dc28d3d-cf9a-4a1d-b980-e7f78ad7aaee rd.luks.uuid=luks-98f2c2d3-77e1-444a-b3f2-d3396b53e16e rhgb quiet mem_sleep_default=deep pti=off spectre_v2=off l1tf=off nospec_store_bypass_disable no_stf_barrier amdgpu.ppfeaturemask=0xffffffff i915.enable_guc=0 module_blacklist=i915 3
Idea 2. Maybe somehow Nine performs more IO through PCI-E than wined3d and the Thunderbolt 3 port is the bottleneck. Is it a possibility that there are some operations within Nine that are not noticable when using a PCI-E x16 connection but become problematic when running over PCI-E x4? I don't know how to verify this theory, but maybe there is a way to check the PCI-E port usage.
Idea 3. It occours to me that I installed DirectX 9 from winetricks
in this Wine prefix before I started using Nine. Is it possible that this interferes with Nine somehow?
@iiv3
do you compile your own kernel? If so, you might have some additional tracing enabled.
No, I've got 4.19 from Fedora: 4.19.13-300.fc29.x86_64
If not... Could you try to disable "iommu=off", in case you are hitting some "soft" emulation that involves page faults.
Good idea, I will try that.
If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.
Sure, I'll take a look, though I'm afraid I'm not an expert in perf
.
Sorry for the long post!
Actually most motherboards have very few PCIE x16 slots, usually one or two. While there are might be more full sized PCIE slots, they are actually working as x4 ones. So maybe @axeldavy could try and test his video card on x4 PCIE slot.
@Venemo,
You could try radeon top
program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.
You could also take a look what options are available for you in R600_DEBUG
, despite its name it works on latest radeons too. Do an R600_DEBUG=help glxgears
and you will get all available options (some drivers have more). Try everything that is not about dumping stuff to the console. (Try it one by one.)
Thunderbold3 definitely complicates things.
I would add testing all working iommu
types, in case you do need a hardware support for it.
I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data. When used over my PCIE x16 it would get something like 25% CPU in perf top
and 50 fps instead of 70fps (benchmark avg).
This reminds me...
You had two separate perf top
logs that showed si_set_constant_buffer
and amdgpu_mm_rreg
as top usage functions... but I could not quite understand what have you done differently to get one or the other result.
It's not compilation or inline-ing, as one is from mesa3d
and the other is from the kernel
.
If you are compiling your own mesa3d
try to use -O1
or -Og
, it might give some hints. Actually compiling with debug support would be very good idea.
When you run perf top
you can select function with arrow keys. Press enter
once, then press it second time to enter into Annotate
. In that mode it will show you how hot each instructions from that functions are. If you have debug support it might show you even the source.
Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.
export GALLIUM_HUD=\
".dfps+cpu+GPU-load+temperature,.dGTT-usage+VRAM-usage,num-compilations+num-shaders-created;"\
"primitives-generated+draw-calls,samples-passed+ps-invocations+vs-invocations,buffer-wait-time;"\
"CS-thread-busy+gallium-thread-busy,dma-calls+cp-dma-calls,num-bytes-moved;"\
"num-vs-flushes+num-ps-flushes+num-cs-flushes+num-CB-cache-flushes+num-DB-cache-flushes"
You can check if you have an extra graphs with GALLIUM_HUD=help glxgears
and add them too.
@iiv3 Thanks for the suggestions! I'll try them.
I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data.
What was the solution to this one? How do I check if this is the case?
You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result.
I honestly don't know, it is possible that I have messed up something in that WINEPREFIX. I will re-install the game in a clean prefix and use that for further testing.
Every frame, vertices must be sent to the graphic card for the draw calls. There's also the draw commands sent by the driver. Good games don't upload anything else.
As mentionned already, the vertices for SC2 are stored in a permanently mapped buffer. The allocation is done in GTT, ie graphic memory in cpu ram. At draw time, the graphic card fetches the data there. The thing is if that was the limiting factor, the GPU-load would be higher: indeed the GPU is running shaders when fetching these vertice data, and it should count as GPU runnning. Still, you can try use VRAM for these allocations (the size of the mappable VRAM is restricted to 64MB, but it should be enough there). Replace all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c to PIPE_USAGE_DEFAULT. Possibly in your case it should be faster as the data would make it to the GPU sooner.
As for the draw commands, there's not much to do.
Maybe as iive suggested you have wrong flags somewhere which leads to unoptimal functions somewhere.
Okay, so I upgraded to Fedora 29, kernel 5.0-rc1 and wine staging 4.0-rc1. Then I deleted the wine prefix and reinstalled the game in a clean prefix.
If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.
There is a mov
instruction which takes ~96% of the time spent in that function.
You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.
Here is the output. I couldn't copy-paste the bars from the terminal, but the percentages are clearly visible.
Graphics pipe 39,17% │
───────────────────────────────────────────────┼─
Event Engine 0,00% │
│
Vertex Grouper + Tesselator 10,00% │
│
Texture Addresser 24,17% │
│
Shader Export 31,67% │
Sequencer Instruction Cache 8,33% │
Shader Interpolator 33,33% │
│
Scan Converter 34,17% │
Primitive Assembly 10,83% │
│
Depth Block 30,83% │
Color Block 30,00% │
│
2815M / 3995M VRAM 70,44% │
73M / 4093M GTT 1,79% │
1,75G / 1,75G Memory Clock 100,00% │
1,24G / 1,24G Shader Clock 100,00% │
You could also take a look what options are available for you in R600_DEBUG
I took a look, and tried a few options but some of them caused a crash, while others caused noticable performance decrease, so I didn't get too far with it. Is there a specific option you want me to try?
I would add testing all working iommu types, in case you do need a hardware support for it.
With iommu=off
my wifi card is not recognized, which is a problem. With iommu=on intel_iommu=on
I get roughly +5 fps, and the "gallium thread busy" line from GALLIUM_HUD
goes to zero. So I guess that is a good thing at least.
If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea.
I just use the mesa that comes with Fedora. I'll also try the version from the "che" repo which is supposed to be a newer version with more optimizations enabled, will report back if it helps.
Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.
Here is a screenshot from the graphs with intel_iommu=on
:
Oops, I accidentally hit "close and comment", sorry. Reopening it now.
@axeldavy Thanks! How come that the samples passed and the ps/vs invocations are so low on your machine? Also the primitives generated seem much less than mine.
It depends on the scene and resolution.
Okay, I tried the patch suggested by @axeldavy and replaced PIPE_USAGE_STREAM
to PIPE_USAGE_DEFAULT
in the specified files. It definitely helps, but does not solve the problem.
Here are the graphs from the patched nine:
Hotspots from perf top
:
13,66% d3dadapter9.so.1.0.0 [.] si_set_constant_buffer
4,88% anonmap.WTz65d (deleted) [.] 0x000000000023f415
2,69% d3dadapter9.so.1.0.0 [.] NineDevice9_SetIndices
2,57% d3dadapter9.so.1.0.0 [.] amdgpu_do_add_real_buffer
2,19% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
1,30% dsound.dll.so [.] 0x000000000002997a
1,07% anonmap.WTz65d (deleted) [.] 0x000000000023f403
0,63% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,59% anonmap.WTz65d (deleted) [.] 0x0000000001272ed6
0,58% dsound.dll.so [.] 0x00000000000298e1
Here is the hot spot from the annotated si_set_constant_buffer
:
│ si_descriptors.c:1212 ▒
0,01 │ mov (%rsi),%rax ▒
0,00 │ lea (%rax,%r14,8),%rdx ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:139 ▒
0,07 │ mov (%rdx),%rsi ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:79 ▒
0,01 │ cmp %rsi,%r8 ▒
│ ↓ je 8c ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:87 ▒
0,00 │ test %rsi,%rsi ▒
│ ↓ je 8c ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:89 ▒
98,08 │ lock subl $0x1,(%rsi) ▒
1,16 │ ↓ jne 8c ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:146 ▒
0,00 │ 4d: mov 0x20(%rsi),%rax ▒
│ ../../../../src/gallium/auxiliary/util/u_inlines.h:148 ▒
│ mov 0x28(%rsi),%rcx ▒
│ mov %rdx,0x8(%rsp)
I also examined the same graph while running wined3d with the PBA patch:
Hotspots according to perf top
:
16,67% radeonsi_dri.so [.] u_upload_alloc ◆
3,12% radeonsi_dri.so [.] cso_set_vertex_buffers ▒
2,27% radeonsi_dri.so [.] tc_call_draw_vbo ▒
1,84% radeonsi_dri.so [.] amdgpu_cs_add_buffer ▒
1,68% libc-2.28.so [.] __memmove_avx_unaligned_erms ▒
1,53% radeonsi_dri.so [.] amdgpu_do_add_real_buffer ▒
1,25% wined3d.dll.so [.] 0x000000000005a390 ▒
1,20% wined3d.dll.so [.] 0x000000000005b5e4 ▒
1,17% wined3d.dll.so [.] 0x000000000005b870 ▒
0,90% libpthread-2.28.so [.] __pthread_mutex_lock ▒
0,79% ntdll.dll.so [.] RtlEnterCriticalSection ▒
0,73% wined3d.dll.so [.] 0x000000000005a3af ▒
0,66% wined3d.dll.so [.] 0x000000000005a39a ▒
0,64% wined3d.dll.so [.] 0x000000000005b52d ▒
0,63% wined3d.dll.so [.] 0x000000000005b5d0 ▒
0,63% wined3d.dll.so [.] 0x000000000005b670 ▒
0,61% wined3d.dll.so [.] 0x000000000005b4a6 ▒
0,61% ntdll.dll.so [.] RtlLeaveCriticalSection ▒
0,59% [amdgpu] [k] amdgpu_mm_rreg ▒
0,57% wined3d.dll.so [.] 0x000000000005aad0 ▒
0,57% libpthread-2.28.so [.] __pthread_mutex_unlock_usercnt ▒
0,53% wined3d.dll.so [.] 0x000000000005b5ed ▒
0,51% radeonsi_dri.so [.] _mesa_unmarshal_dispatch_cmd ▒
0,50% wined3d.dll.so [.] 0x000000000005aabd
Which means that wined3d+PBA still beats nine by a dozen or so FPS. What stands out to me is:
- wined3d-pba uses 10 times as much GTT, but only half the VRAM
- nine uses significantly less CPU, but wined3d-pba manages to load the GPU about 10% more than nine
- the ps/vs invocations are in the same ballpark, though wined3d-pba has somewhat less samples passed
Further notes:
- Both screenshots were taken from approx. the same point in the same replay on the same graphical settings.
- For the sake of this experiment I downgraded to Wine 3.13, because that is the version I could easily get for Fedora with the PBA patch included. I recompiled Nine standalone for this Wine version.
- mesa 18.2.8 was patched, the patch was added to
mesa.spec
and the wholex86_64
mesa package were rebuilt, but I only installed themesa-d3d
packages from the build output. (I assume the change doesn't affect the rest of the packages.)
Where do we go from here?
Here is what lspci -vvv
has to say about the GPU:
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
Subsystem: Sapphire Technology Limited Device e343
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 128 bytes
Interrupt: pin A routed to IRQ 156
Region 0: Memory at 60000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at 70000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at 2000 [size=256]
Region 5: Memory at ac000000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at ac040000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee006b8 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [200 v1] Resizable BAR <?>
Capabilities: [270 v1] Secondary PCI Express <?>
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable-, Smallest Translation Unit: 00
Capabilities: [2c0 v1] Page Request Interface (PRI)
PRICtl: Enable- Reset-
PRISta: RF- UPRGI- Stopped+
Page Request Capacity: 00000020, Page Request Allocation: 00000000
Capabilities: [2d0 v1] Process Address Space ID (PASID)
PASIDCap: Exec+ Priv+, Max PASID Width: 10
PASIDCtl: Enable- Exec- Priv-
Capabilities: [320 v1] Latency Tolerance Reporting
Max snoop latency: 71680ns
Max no snoop latency: 71680ns
Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Here is what amdgpu
's sysfs has to say:
[root@timur-xps ~]# cat /sys/class/drm/card1/device/pp_dpm_pcie
0: 2.5GT/s, x8
1: 8.0GT/s, x16 *
Reducing the bandwidth using echo 0 > /sys/class/drm/card1/device/pp_dpm_pcie
did not seem to have any effect on the frame rate, but I'm honestly not sure how reliable pp_dpm_pcie
is. Or maybe the TB3 bottleneck is more severe than the restriction that it can impose.
Trying to determine the available bandwidth using clpeak
is not possible due to the missing PCI-E atomics support:
kfd kfd: skipped device 1002:67df, PCI rejects atomics
However this egpu.io page states the following: Intel throttles 32Gbps-TB3 to 22Gbps which benchmarks as 20.40Gbps (AMD) or 18.91Gbps (Nvidia) in the important H2D direction.
I took a look at the hot spots for Nine in perf top
and investigated a bit:
si_set_constant_buffer
-> pipe_resource_reference
-> pipe_reference_described
-> p_atomic_dec_zero
which is a call to either __atomic_sub_fetch
or __sync_sub_and_fetch
(preferring the __atomic
version if available)
Looking at the other items like NineDevice9_SetIndices
and amdgpu_do_add_real_buffer
, all of them wait for a similar synchronization primitive.
Cannot reproduce here either
- Xeon E3-1231 v3 @ 3.40GHz
- R9 380X
- Linux 4.19.13
- mesa master from ~2 weeks ago
- wine 4.0~rc6
- sc2 set to 1080p with ultra gfx settings
without vsync:
and with vsync:
@dhewg @axeldavy Which distro do you guys use? Just asking because I might just try running that distro as an experiment. Maybe there is some package that is better optimized there than here, and maybe that makes the difference. Or maybe mesa itself is compiled with different flags?
I'm on Debian, @axeldavy on Arch iirc.
I built mesa myself though, using CFLAGS="-march=native" CXXFLAGS="-march=native" meson ~/src/mesa --prefix /opt/andre/mesa -Ddri-drivers= -Dgallium-drivers=radeonsi,swrast -Dvulkan-drivers=amd -Dgallium-nine=true -Dosmesa=gallium
Okay, so @axeldavy Suggested an approach that helps find whether the constant data is the bottleneck or not:
08:20 mannerov: in nine_state.c 08:20 mannerov: prepare_vs_constants_userbuf 08:20 mannerov: when cb.buffer_size is set 08:21 mannerov: set it to the maximum value instead, that is sizeof(float[4]) * NINE_MAX_CONST_ALL 08:21 mannerov: if it doesn't decrease fps, it is not limiting
Here is how the patch file looks:
From debd5413b1d14a28f26ae40c1d907df621044c8a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <[email protected]>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change prepare_vs_constants_userbuf to always use the maximum
size.
---
src/gallium/state_trackers/nine/nine_state.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index bf39ebc9b4..209518ac2e 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -431,7 +431,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
struct pipe_constant_buffer cb;
cb.buffer = NULL;
cb.buffer_offset = 0;
- cb.buffer_size = context->vs->const_used_size;
+ cb.buffer_size = sizeof(float[4]) * NINE_MAX_CONST_ALL;
cb.user_buffer = context->vs_const_f;
if (context->swvp) {
--
2.20.1
When I run the game with this patch:
- I get worse frame rate than before, it's about 10 fps lower on the scene where I took the two screenshots above.
- VRAM usage is hugely worse than before, nearing 3 GB.
- There is a new hot spot in
perf top
.
Now the perf top
looks like this:
17,21% d3dadapter9.so.1.0.0 [.] si_decompress_textures
12,59% d3dadapter9.so.1.0.0 [.] si_set_constant_buffer
4,72% d3dadapter9.so.1.0.0 [.] nine_context_draw_indexed_primitive_priv
1,58% d3dadapter9.so.1.0.0 [.] amdgpu_do_add_real_buffer
1,31% dsound.dll.so [.] 0x000000000002997a
1,06% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
0,74% anonmap.qaDqCx (deleted) [.] 0x000000000023f415
0,65% d3dadapter9.so.1.0.0 [.] si_set_active_descriptors_for_shader
0,55% dsound.dll.so [.] 0x00000000000298e1
0,55% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,54% d3dadapter9.so.1.0.0 [.] amdgpu_add_fence_dependencies_bo_list
0,49% d3dadapter9.so.1.0.0 [.] si_bind_vs_shader
0,49% [unknown] [.] 0000000000000000
0,45% d3dadapter9.so.1.0.0 [.] nine_update_state
0,44% d3dadapter9.so.1.0.0 [.] amdgpu_lookup_buffer
0,42% dsound.dll.so [.] 0x0000000000029971
Interesting how si_decompress_textures
now appears in there. I also annotated it:
│ si_blit.c:780 ▒
1,71 │ 18: push %r15 ▒
44,26 │ push %r14 ▒
7,00 │ push %r13 ▒
30,18 │ push %r12 ▒
16,11 │ push %rbp ▒
0,02 │ push %rbx ▒
│ mov %rdi,%rbx ▒
0,01 │ sub $0x28,%rsp
@dhewg Is there anything in here that stands out to you? https://src.fedoraproject.org/rpms/mesa/blob/f29/f/mesa.spec#_365
I also tried the the other suggestion by @axeldavy - here are the results.
23:43 mannerov: Venemo_j: another thing you could try is in si_pipe.c in the const_uploader = u_upload_create call to replace DEFAULT by STREAM
Here is what I experience when I run the game with this:
- In the same scene, the fps is increased to ~80
- GTT usage increased to ~400 MB
- VRAM usage is about ~1.5 GB
- GPU usage is reported between 85% - 99%
- I haven't noticed any visual glitch or rendering error thus far.
- Mesa has disappeared from the hot spot in
perf top
. Now it's just theamdgpu_mm_rreg
that is still there. - Now if I run the game on 1080p with this patch, then I get ~110 fps on the same scene with the same settings. However if I set everything to Ultra then it's hardly 60 fps.
This is the patch:
From 7025c93aca84713c25b3a73ff9e2db91493f217c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <[email protected]>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change si_create_context to use STREAM for const_uploader.
---
src/gallium/drivers/radeonsi/si_pipe.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 6b36893698..ce77a4fd4c 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -409,7 +409,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
goto fail;
sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
- 0, PIPE_USAGE_DEFAULT,
+ 0, PIPE_USAGE_STREAM,
SI_RESOURCE_FLAG_32BIT |
(sscreen->cpdma_prefetch_writes_memory ?
0 : SI_RESOURCE_FLAG_READ_ONLY));
--
2.20.1
perf top
result:
12,26% [amdgpu] [k] amdgpu_mm_rreg
1,98% dsound.dll.so [.] 0x000000000002997a
0,92% d3dadapter9.so.1.0.0 [.] amdgpu_cs_add_buffer
0,81% dsound.dll.so [.] 0x00000000000298e1
0,78% libc-2.28.so [.] __memmove_avx_unaligned_erms
0,65% [kernel] [k] update_blocked_averages
0,64% anonmap.8QZ4Mi (deleted) [.] 0x0000000001c16777
0,63% dsound.dll.so [.] 0x0000000000029971
0,60% dsound.dll.so [.] 0x00000000000298c5
0,60% d3dadapter9.so.1.0.0 [.] amdgpu_lookup_buffer
0,57% d3dadapter9.so.1.0.0 [.] nine_context_set_texture
0,54% [kernel] [k] psi_task_change
0,50% d3dadapter9.so.1.0.0 [.] NineDevice9_SetTexture
OOOPS, looks like I forgot to revert the other patch before applying this one.
Will do another test with just the si_pipe patch.
Cannot reproduce here either * Xeon E3-1231 v3 @ 3.40GHz * R9 380X * Linux 4.19.13 * mesa master from ~2 weeks ago * wine 4.0~rc6 * sc2 set to 1080p with ultra gfx settings
The @Venemo, setup involves thunderbolt to run the card outside the laptop. This limits it to pcie x4 speeds.
To recreate his setup you should at least move your video card to pcie x4 slot. Most MB have only 1-2 PCIE x16 slots, and the rest are x4, even though they use the full x16 connector.
You may try it if this is not going to void your PC warranty.
Rebuilt mesa with the correct patch and updated my last comment.
@Venemo: No idea about rpm spec stuff, but nothing stands out I guess. I just tried against mesa packages from debian, and it looks like the fps is roughly the same on sc2
While the si_pipe
patch does help a lot, it's not perfect. Even though Gallium says that the GPU utilization is 99%, radeontop
disagrees, and says the utilization is ~50% (up from the previous ~30%).
@dhewg , to reproduce the issue you need to move the video card to PCIE x4 slot.
As I said in my previous comment to you, a lot of MB have full sized slots that are x4.
@iiv3 yeah, I get that it's a bottleneck that I don't run into. Just confirming that it's not a general issue
I ported the change to r600_pipe_common.c
::706 and run a benchmark.
Without the change l4d2 got 83,55 and 83,72fps.
After the change it got 84,27 and 83,92fps.
I'd say that it is a consistent improvement, even thought 0.5% is around the margin of error.
I get better performance with the constant compacting code, even though it's still not 100% usage.
With vsync disabled in-game and thread_submit=true
as well as tearfree_discard=true
my gpu load is maxed out. That is with upstream mesa, so none of those new patches.
I built the latest master (as of today) from the ixit repo and ran SC2 with the mesa built from there. The results are definitely an improvement, but not as amazing as the radeonsi patch that I tried earlier.
- Frame rate is ~50 fps in the same scene.
- GTT usage is ~60-70 MB
- VRAM usage is ~2-2.5 GB
- GPU load is ~45-50%
I also tried combining this with the patch that replaces all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c with PIPE_USAGE_DEFAULT.
- No noticable difference in frame rate (maybe a little higher)
- GTT usage is ~40-50 MB (20 MB lower)
- No noticable difference in VRAM usage
- GPU load is slightly higher around 50%
Investigated a bit further. I'm not 100% sure that the bottleneck is VS constants. I made the following changes:
-
si_pipe.c
insi_create_context
added theSI_RESOURCE_FLAG_32BIT
tostream_uploader
which allows it to receive shader constants -
nine_state.c
inprepare_vs_constants_userbuf
changed theu_upload_alloc
andu_upload_unmap
calls to use thestream_uploader
This is a huge hack (the stream uploader is not supposed to handle shader constants), but it yields a huge improvement over everything else we've tried so far. It now reaches ~95 fps in the same scene. Note that this is even better than the version where I set all constants to use the stream uploader (which was ~80 fps).
Conclusion: something about the vertex shader constants in SC2 causes a bottleneck in nine when the PCI-E bandwidth is limited.
Can you try to switch const_uploader from DEFAULT to STREAM? If that's not enough, can you increase the size from 128 * 1024 to 1024 * 1024?