Mesa-3D Low frame rates on Starcraft 2 while CPU and GPU utilization is low

Long story short, I observed a scenario where I get low frame rates even though the CPU and GPU utilization was low. This is a continuation of this issue where we started discussing a performance problem with Starcraft 2 when running on Gallium Nine standalone.

Hardware details: the system is a Dell XPS 13 9370 with a 4-core / 8-thread Intel 8550U, a Sapphire AMD Radeon RX 570 ITX card is connected through Thunderbolt 3 (which is roughly equivalent to a x4 PCI-E connection), and a 4K Dell U2718Q screen is connected to the RX 570's DisplayPort output.

I didn't measure the frame rates in a scientific way, but the difference is very noticable:

On the same setup I easily get 60-80 fps on all high settings on Windows 10
Running Gallium Nine standalone on Fedora 29's Wine 3.21 on medium settings results in about 30-40 fps ― the curlpit seems to be the shaders, if I turn it down to low shaders then it gets above 60 fps.
If I run Wine 3.13 with the PBA patch from the espionage724 repo, then I get about 50-60 fps (again on medium settings)

Some observations from the other thread that are worth to mention here:

The frame rate doesn't seem to be related to resolution. Setting the system to use 1080p (using Gnome settings) results in similar numbers when using Nine.
According to GALLIUM_HUD CPU utilization is around 30% and GPU utilization is around 50%
According to perf top the majority of time is spent in functions like si_set_constant_buffer amdgpu_mm_rreg and NineDevice9_SetIndices from which I got the impression that some buffers are copied to/from the GPU and that may be a problem.

@axeldavy

On the first campain, after the videos, I get about 140 fps on full hd, everything maxed out on my radeon rx 480. The GPU load is about 90%. My 4 cpu threads are all around 60-70%.. I used GALLIUM_HUD=GPU-load,fps,cpu0+cpu1+cpu2+cpu3,shader-clock+memory-clock+temperature tearfree_discard=true WINEDEBUG=-all

That does sound awesome, and makes me think that the problem may be with my setup, or that maybe my setup triggers some sort of corner case within Nine. After thinking about it more, I've got three ideas:

Idea 1. Maybe there's something wrong with my kernel command line? Bascially what I do is disable all the spectre / meltdown mitigations, I enable the PowerPlay features in amdgpu and blacklist the i915 module.

resume=UUID=0dc28d3d-cf9a-4a1d-b980-e7f78ad7aaee rd.luks.uuid=luks-98f2c2d3-77e1-444a-b3f2-d3396b53e16e rhgb quiet mem_sleep_default=deep pti=off spectre_v2=off l1tf=off nospec_store_bypass_disable no_stf_barrier amdgpu.ppfeaturemask=0xffffffff i915.enable_guc=0 module_blacklist=i915 3

Idea 2. Maybe somehow Nine performs more IO through PCI-E than wined3d and the Thunderbolt 3 port is the bottleneck. Is it a possibility that there are some operations within Nine that are not noticable when using a PCI-E x16 connection but become problematic when running over PCI-E x4? I don't know how to verify this theory, but maybe there is a way to check the PCI-E port usage.

Idea 3. It occours to me that I installed DirectX 9 from winetricks in this Wine prefix before I started using Nine. Is it possible that this interferes with Nine somehow?

@iiv3

do you compile your own kernel? If so, you might have some additional tracing enabled.

No, I've got 4.19 from Fedora: 4.19.13-300.fc29.x86_64

If not... Could you try to disable "iommu=off", in case you are hitting some "soft" emulation that involves page faults.

Good idea, I will try that.

If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.

Sure, I'll take a look, though I'm afraid I'm not an expert in perf.

Sorry for the long post!

Jan 13 '19 14:01 Venemo

Actually most motherboards have very few PCIE x16 slots, usually one or two. While there are might be more full sized PCIE slots, they are actually working as x4 ones. So maybe @axeldavy could try and test his video card on x4 PCIE slot.

@Venemo, You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.

You could also take a look what options are available for you in R600_DEBUG, despite its name it works on latest radeons too. Do an R600_DEBUG=help glxgears and you will get all available options (some drivers have more). Try everything that is not about dumping stuff to the console. (Try it one by one.)

Thunderbold3 definitely complicates things. I would add testing all working iommu types, in case you do need a hardware support for it.

I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data. When used over my PCIE x16 it would get something like 25% CPU in perf top and 50 fps instead of 70fps (benchmark avg).

This reminds me... You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result. It's not compilation or inline-ing, as one is from mesa3d and the other is from the kernel.

If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea. When you run perf top you can select function with arrow keys. Press enter once, then press it second time to enter into Annotate. In that mode it will show you how hot each instructions from that functions are. If you have debug support it might show you even the source.

Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.

export GALLIUM_HUD=\
".dfps+cpu+GPU-load+temperature,.dGTT-usage+VRAM-usage,num-compilations+num-shaders-created;"\
"primitives-generated+draw-calls,samples-passed+ps-invocations+vs-invocations,buffer-wait-time;"\
"CS-thread-busy+gallium-thread-busy,dma-calls+cp-dma-calls,num-bytes-moved;"\
"num-vs-flushes+num-ps-flushes+num-cs-flushes+num-CB-cache-flushes+num-DB-cache-flushes"

You can check if you have an extra graphs with GALLIUM_HUD=help glxgears and add them too.

Jan 15 '19 01:01 iiv3

@iiv3 Thanks for the suggestions! I'll try them.

I had a similar slowdown caused by the fact that my distribution had i586 optimized glibc, that used a memcpy() with hack for original Pentium. It would read from the destination memory, before overwriting it with real data.

What was the solution to this one? How do I check if this is the case?

You had two separate perf top logs that showed si_set_constant_buffer and amdgpu_mm_rreg as top usage functions... but I could not quite understand what have you done differently to get one or the other result.

I honestly don't know, it is possible that I have messed up something in that WINEPREFIX. I will re-install the game in a clean prefix and use that for further testing.

Jan 15 '19 07:01 Venemo

Every frame, vertices must be sent to the graphic card for the draw calls. There's also the draw commands sent by the driver. Good games don't upload anything else.

As mentionned already, the vertices for SC2 are stored in a permanently mapped buffer. The allocation is done in GTT, ie graphic memory in cpu ram. At draw time, the graphic card fetches the data there. The thing is if that was the limiting factor, the GPU-load would be higher: indeed the GPU is running shaders when fetching these vertice data, and it should count as GPU runnning. Still, you can try use VRAM for these allocations (the size of the mappable VRAM is restricted to 64MB, but it should be enough there). Replace all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c to PIPE_USAGE_DEFAULT. Possibly in your case it should be faster as the data would make it to the GPU sooner.

As for the draw commands, there's not much to do.

Maybe as iive suggested you have wrong flags somewhere which leads to unoptimal functions somewhere.

Jan 15 '19 07:01 axeldavy

Okay, so I upgraded to Fedora 29, kernel 5.0-rc1 and wine staging 4.0-rc1. Then I deleted the wine prefix and reinstalled the game in a clean prefix.

If this doesn't help either, try entering the "amdgpu_mm_rreg" in perf top and see which instructions are getting most heat.

There is a mov instruction which takes ~96% of the time spent in that function.

You could try radeon top program, that shows utilization of specific GPU subsystems. It doesn't sound like the bottleneck is there, but it might give some useful hint.

Here is the output. I couldn't copy-paste the bars from the terminal, but the percentages are clearly visible.

                         Graphics pipe  39,17% │
───────────────────────────────────────────────┼─
                 Event          Engine   0,00% │
                                               │
           Vertex Grouper + Tesselator  10,00% │
                                               │
                     Texture Addresser  24,17% │
                                               │
                         Shader Export  31,67% │
           Sequencer Instruction Cache   8,33% │
                   Shader Interpolator  33,33% │
                                               │
                        Scan Converter  34,17% │
                    Primitive Assembly  10,83% │
                                               │
                           Depth Block  30,83% │
                           Color Block  30,00% │
                                               │
                    2815M / 3995M VRAM  70,44% │
                       73M / 4093M GTT   1,79% │
            1,75G / 1,75G Memory Clock 100,00% │
            1,24G / 1,24G Shader Clock 100,00% │

You could also take a look what options are available for you in R600_DEBUG

I took a look, and tried a few options but some of them caused a crash, while others caused noticable performance decrease, so I didn't get too far with it. Is there a specific option you want me to try?

I would add testing all working iommu types, in case you do need a hardware support for it.

With iommu=off my wifi card is not recognized, which is a problem. With iommu=on intel_iommu=on I get roughly +5 fps, and the "gallium thread busy" line from GALLIUM_HUD goes to zero. So I guess that is a good thing at least.

If you are compiling your own mesa3d try to use -O1 or -Og, it might give some hints. Actually compiling with debug support would be very good idea.

I just use the mesa that comes with Fedora. I'll also try the version from the "che" repo which is supposed to be a newer version with more optimizations enabled, will report back if it helps.

Also, I've tried to make a line that would show all graphs on the screen, try it in case it provides some hint.

Here is a screenshot from the graphs with intel_iommu=on:

Here is a graph

Jan 15 '19 14:01 Venemo

Oops, I accidentally hit "close and comment", sorry. Reopening it now.

Jan 15 '19 15:01 Venemo

performance

Jan 15 '19 21:01 axeldavy

@axeldavy Thanks! How come that the samples passed and the ps/vs invocations are so low on your machine? Also the primitives generated seem much less than mine.

Jan 15 '19 23:01 Venemo

It depends on the scene and resolution.

Jan 16 '19 07:01 axeldavy

Okay, I tried the patch suggested by @axeldavy and replaced PIPE_USAGE_STREAM to PIPE_USAGE_DEFAULT in the specified files. It definitely helps, but does not solve the problem.

Here are the graphs from the patched nine:

Graphs from patched nine

Hotspots from perf top:

  13,66%  d3dadapter9.so.1.0.0      [.] si_set_constant_buffer
   4,88%  anonmap.WTz65d (deleted)  [.] 0x000000000023f415
   2,69%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetIndices
   2,57%  d3dadapter9.so.1.0.0      [.] amdgpu_do_add_real_buffer
   2,19%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   1,30%  dsound.dll.so             [.] 0x000000000002997a
   1,07%  anonmap.WTz65d (deleted)  [.] 0x000000000023f403
   0,63%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,59%  anonmap.WTz65d (deleted)  [.] 0x0000000001272ed6
   0,58%  dsound.dll.so             [.] 0x00000000000298e1

Here is the hot spot from the annotated si_set_constant_buffer:

       │     si_descriptors.c:1212                                                                                                                                                                                  ▒
  0,01 │       mov    (%rsi),%rax                                                                                                                                                                                   ▒
  0,00 │       lea    (%rax,%r14,8),%rdx                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:139                                                                                                                                                 ▒
  0,07 │       mov    (%rdx),%rsi                                                                                                                                                                                   ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:79                                                                                                                                                  ▒
  0,01 │       cmp    %rsi,%r8                                                                                                                                                                                      ▒
       │     ↓ je     8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:87                                                                                                                                                  ▒
  0,00 │       test   %rsi,%rsi                                                                                                                                                                                     ▒
       │     ↓ je     8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:89                                                                                                                                                  ▒
 98,08 │       lock   subl   $0x1,(%rsi)                                                                                                                                                                            ▒
  1,16 │     ↓ jne    8c                                                                                                                                                                                            ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:146                                                                                                                                                 ▒
  0,00 │ 4d:   mov    0x20(%rsi),%rax                                                                                                                                                                               ▒
       │     ../../../../src/gallium/auxiliary/util/u_inlines.h:148                                                                                                                                                 ▒
       │       mov    0x28(%rsi),%rcx                                                                                                                                                                               ▒
       │       mov    %rdx,0x8(%rsp)

I also examined the same graph while running wined3d with the PBA patch:

Graphs from wined3d PBA

Hotspots according to perf top:

  16,67%  radeonsi_dri.so           [.] u_upload_alloc                                                                                                                                                              ◆
   3,12%  radeonsi_dri.so           [.] cso_set_vertex_buffers                                                                                                                                                      ▒
   2,27%  radeonsi_dri.so           [.] tc_call_draw_vbo                                                                                                                                                            ▒
   1,84%  radeonsi_dri.so           [.] amdgpu_cs_add_buffer                                                                                                                                                        ▒
   1,68%  libc-2.28.so              [.] __memmove_avx_unaligned_erms                                                                                                                                                ▒
   1,53%  radeonsi_dri.so           [.] amdgpu_do_add_real_buffer                                                                                                                                                   ▒
   1,25%  wined3d.dll.so            [.] 0x000000000005a390                                                                                                                                                          ▒
   1,20%  wined3d.dll.so            [.] 0x000000000005b5e4                                                                                                                                                          ▒
   1,17%  wined3d.dll.so            [.] 0x000000000005b870                                                                                                                                                          ▒
   0,90%  libpthread-2.28.so        [.] __pthread_mutex_lock                                                                                                                                                        ▒
   0,79%  ntdll.dll.so              [.] RtlEnterCriticalSection                                                                                                                                                     ▒
   0,73%  wined3d.dll.so            [.] 0x000000000005a3af                                                                                                                                                          ▒
   0,66%  wined3d.dll.so            [.] 0x000000000005a39a                                                                                                                                                          ▒
   0,64%  wined3d.dll.so            [.] 0x000000000005b52d                                                                                                                                                          ▒
   0,63%  wined3d.dll.so            [.] 0x000000000005b5d0                                                                                                                                                          ▒
   0,63%  wined3d.dll.so            [.] 0x000000000005b670                                                                                                                                                          ▒
   0,61%  wined3d.dll.so            [.] 0x000000000005b4a6                                                                                                                                                          ▒
   0,61%  ntdll.dll.so              [.] RtlLeaveCriticalSection                                                                                                                                                     ▒
   0,59%  [amdgpu]                  [k] amdgpu_mm_rreg                                                                                                                                                              ▒
   0,57%  wined3d.dll.so            [.] 0x000000000005aad0                                                                                                                                                          ▒
   0,57%  libpthread-2.28.so        [.] __pthread_mutex_unlock_usercnt                                                                                                                                              ▒
   0,53%  wined3d.dll.so            [.] 0x000000000005b5ed                                                                                                                                                          ▒
   0,51%  radeonsi_dri.so           [.] _mesa_unmarshal_dispatch_cmd                                                                                                                                                ▒
   0,50%  wined3d.dll.so            [.] 0x000000000005aabd

Which means that wined3d+PBA still beats nine by a dozen or so FPS. What stands out to me is:

wined3d-pba uses 10 times as much GTT, but only half the VRAM
nine uses significantly less CPU, but wined3d-pba manages to load the GPU about 10% more than nine
the ps/vs invocations are in the same ballpark, though wined3d-pba has somewhat less samples passed

Further notes:

Both screenshots were taken from approx. the same point in the same replay on the same graphical settings.
For the sake of this experiment I downgraded to Wine 3.13, because that is the version I could easily get for Fedora with the PBA patch included. I recompiled Nine standalone for this Wine version.
mesa 18.2.8 was patched, the patch was added to mesa.spec and the whole x86_64 mesa package were rebuilt, but I only installed the mesa-d3d packages from the build output. (I assume the change doesn't affect the rest of the packages.)

Where do we go from here?

Jan 16 '19 16:01 Venemo

Here is what lspci -vvv has to say about the GPU:

08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef) (prog-if 00 [VGA controller])
	Subsystem: Sapphire Technology Limited Device e343
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 128 bytes
	Interrupt: pin A routed to IRQ 156
	Region 0: Memory at 60000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at 70000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at 2000 [size=256]
	Region 5: Memory at ac000000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at ac040000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (downgraded)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee006b8  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [200 v1] Resizable BAR <?>
	Capabilities: [270 v1] Secondary PCI Express <?>
	Capabilities: [2b0 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable-, Smallest Translation Unit: 00
	Capabilities: [2c0 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00000020, Page Request Allocation: 00000000
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 71680ns
		Max no snoop latency: 71680ns
	Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [370 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

Here is what amdgpu's sysfs has to say:

[root@timur-xps ~]# cat /sys/class/drm/card1/device/pp_dpm_pcie 
0: 2.5GT/s, x8 
1: 8.0GT/s, x16 *

Reducing the bandwidth using echo 0 > /sys/class/drm/card1/device/pp_dpm_pcie did not seem to have any effect on the frame rate, but I'm honestly not sure how reliable pp_dpm_pcie is. Or maybe the TB3 bottleneck is more severe than the restriction that it can impose.

Trying to determine the available bandwidth using clpeak is not possible due to the missing PCI-E atomics support:

kfd kfd: skipped device 1002:67df, PCI rejects atomics

However this egpu.io page states the following: Intel throttles 32Gbps-TB3 to 22Gbps which benchmarks as 20.40Gbps (AMD) or 18.91Gbps (Nvidia) in the important H2D direction.

Jan 17 '19 17:01 Venemo

I took a look at the hot spots for Nine in perf top and investigated a bit:

si_set_constant_buffer -> pipe_resource_reference -> pipe_reference_described -> p_atomic_dec_zero which is a call to either __atomic_sub_fetch or __sync_sub_and_fetch (preferring the __atomic version if available)

Looking at the other items like NineDevice9_SetIndices and amdgpu_do_add_real_buffer, all of them wait for a similar synchronization primitive.

Jan 17 '19 18:01 Venemo

Cannot reproduce here either

Xeon E3-1231 v3 @ 3.40GHz
R9 380X
Linux 4.19.13
mesa master from ~2 weeks ago
wine 4.0~rc6
sc2 set to 1080p with ultra gfx settings

without vsync: screenshot_2019-01-18_13-10-30 and with vsync: screenshot_2019-01-18_13-56-08

Jan 18 '19 13:01 dhewg

@dhewg @axeldavy Which distro do you guys use? Just asking because I might just try running that distro as an experiment. Maybe there is some package that is better optimized there than here, and maybe that makes the difference. Or maybe mesa itself is compiled with different flags?

Jan 18 '19 13:01 Venemo

I'm on Debian, @axeldavy on Arch iirc. I built mesa myself though, using CFLAGS="-march=native" CXXFLAGS="-march=native" meson ~/src/mesa --prefix /opt/andre/mesa -Ddri-drivers= -Dgallium-drivers=radeonsi,swrast -Dvulkan-drivers=amd -Dgallium-nine=true -Dosmesa=gallium

Jan 18 '19 13:01 dhewg

Okay, so @axeldavy Suggested an approach that helps find whether the constant data is the bottleneck or not:

08:20 mannerov: in nine_state.c 08:20 mannerov: prepare_vs_constants_userbuf 08:20 mannerov: when cb.buffer_size is set 08:21 mannerov: set it to the maximum value instead, that is sizeof(float[4]) * NINE_MAX_CONST_ALL 08:21 mannerov: if it doesn't decrease fps, it is not limiting

Here is how the patch file looks:

From debd5413b1d14a28f26ae40c1d907df621044c8a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <[email protected]>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change prepare_vs_constants_userbuf to always use the maximum
 size.

---
 src/gallium/state_trackers/nine/nine_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/state_trackers/nine/nine_state.c b/src/gallium/state_trackers/nine/nine_state.c
index bf39ebc9b4..209518ac2e 100644
--- a/src/gallium/state_trackers/nine/nine_state.c
+++ b/src/gallium/state_trackers/nine/nine_state.c
@@ -431,7 +431,7 @@ prepare_vs_constants_userbuf(struct NineDevice9 *device)
     struct pipe_constant_buffer cb;
     cb.buffer = NULL;
     cb.buffer_offset = 0;
-    cb.buffer_size = context->vs->const_used_size;
+    cb.buffer_size = sizeof(float[4]) * NINE_MAX_CONST_ALL;
     cb.user_buffer = context->vs_const_f;
 
     if (context->swvp) {
-- 
2.20.1

When I run the game with this patch:

I get worse frame rate than before, it's about 10 fps lower on the scene where I took the two screenshots above.
VRAM usage is hugely worse than before, nearing 3 GB.
There is a new hot spot in perf top.

Now the perf top looks like this:

  17,21%  d3dadapter9.so.1.0.0      [.] si_decompress_textures
  12,59%  d3dadapter9.so.1.0.0      [.] si_set_constant_buffer
   4,72%  d3dadapter9.so.1.0.0      [.] nine_context_draw_indexed_primitive_priv
   1,58%  d3dadapter9.so.1.0.0      [.] amdgpu_do_add_real_buffer
   1,31%  dsound.dll.so             [.] 0x000000000002997a
   1,06%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,74%  anonmap.qaDqCx (deleted)  [.] 0x000000000023f415
   0,65%  d3dadapter9.so.1.0.0      [.] si_set_active_descriptors_for_shader
   0,55%  dsound.dll.so             [.] 0x00000000000298e1
   0,55%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,54%  d3dadapter9.so.1.0.0      [.] amdgpu_add_fence_dependencies_bo_list
   0,49%  d3dadapter9.so.1.0.0      [.] si_bind_vs_shader
   0,49%  [unknown]                 [.] 0000000000000000
   0,45%  d3dadapter9.so.1.0.0      [.] nine_update_state
   0,44%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_buffer
   0,42%  dsound.dll.so             [.] 0x0000000000029971

Interesting how si_decompress_textures now appears in there. I also annotated it:

       │     si_blit.c:780                                                                                                                                                                                          ▒
  1,71 │ 18:   push   %r15                                                                                                                                                                                          ▒
 44,26 │       push   %r14                                                                                                                                                                                          ▒
  7,00 │       push   %r13                                                                                                                                                                                          ▒
 30,18 │       push   %r12                                                                                                                                                                                          ▒
 16,11 │       push   %rbp                                                                                                                                                                                          ▒
  0,02 │       push   %rbx                                                                                                                                                                                          ▒
       │       mov    %rdi,%rbx                                                                                                                                                                                     ▒
  0,01 │       sub    $0x28,%rsp

Jan 18 '19 14:01 Venemo

@dhewg Is there anything in here that stands out to you? https://src.fedoraproject.org/rpms/mesa/blob/f29/f/mesa.spec#_365

Jan 18 '19 14:01 Venemo

I also tried the the other suggestion by @axeldavy - here are the results.

23:43 mannerov: Venemo_j: another thing you could try is in si_pipe.c in the const_uploader = u_upload_create call to replace DEFAULT by STREAM

Here is what I experience when I run the game with this:

In the same scene, the fps is increased to ~80
GTT usage increased to ~400 MB
VRAM usage is about ~1.5 GB
GPU usage is reported between 85% - 99%
I haven't noticed any visual glitch or rendering error thus far.
Mesa has disappeared from the hot spot in perf top. Now it's just the amdgpu_mm_rreg that is still there.
Now if I run the game on 1080p with this patch, then I get ~110 fps on the same scene with the same settings. However if I set everything to Ultra then it's hardly 60 fps.

This is the patch:

From 7025c93aca84713c25b3a73ff9e2db91493f217c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Timur=20Krist=C3=B3f?= <[email protected]>
Date: Fri, 18 Jan 2019 14:38:49 +0100
Subject: [PATCH] Change si_create_context to use STREAM for const_uploader.

---
 src/gallium/drivers/radeonsi/si_pipe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 6b36893698..ce77a4fd4c 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -409,7 +409,7 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen,
 		goto fail;
 
 	sctx->b.const_uploader = u_upload_create(&sctx->b, 128 * 1024,
-						   0, PIPE_USAGE_DEFAULT,
+						   0, PIPE_USAGE_STREAM,
 						   SI_RESOURCE_FLAG_32BIT |
 						   (sscreen->cpdma_prefetch_writes_memory ?
 							    0 : SI_RESOURCE_FLAG_READ_ONLY));
-- 
2.20.1

perf top result:

  12,26%  [amdgpu]                  [k] amdgpu_mm_rreg
   1,98%  dsound.dll.so             [.] 0x000000000002997a
   0,92%  d3dadapter9.so.1.0.0      [.] amdgpu_cs_add_buffer
   0,81%  dsound.dll.so             [.] 0x00000000000298e1
   0,78%  libc-2.28.so              [.] __memmove_avx_unaligned_erms
   0,65%  [kernel]                  [k] update_blocked_averages
   0,64%  anonmap.8QZ4Mi (deleted)  [.] 0x0000000001c16777
   0,63%  dsound.dll.so             [.] 0x0000000000029971
   0,60%  dsound.dll.so             [.] 0x00000000000298c5
   0,60%  d3dadapter9.so.1.0.0      [.] amdgpu_lookup_buffer
   0,57%  d3dadapter9.so.1.0.0      [.] nine_context_set_texture
   0,54%  [kernel]                  [k] psi_task_change
   0,50%  d3dadapter9.so.1.0.0      [.] NineDevice9_SetTexture

Jan 18 '19 15:01 Venemo

OOOPS, looks like I forgot to revert the other patch before applying this one.

Will do another test with just the si_pipe patch.

Jan 18 '19 15:01 Venemo

Cannot reproduce here either * Xeon E3-1231 v3 @ 3.40GHz * R9 380X * Linux 4.19.13 * mesa master from ~2 weeks ago * wine 4.0~rc6 * sc2 set to 1080p with ultra gfx settings

The @Venemo, setup involves thunderbolt to run the card outside the laptop. This limits it to pcie x4 speeds.

To recreate his setup you should at least move your video card to pcie x4 slot. Most MB have only 1-2 PCIE x16 slots, and the rest are x4, even though they use the full x16 connector.

You may try it if this is not going to void your PC warranty.

Jan 18 '19 22:01 iiv3

Rebuilt mesa with the correct patch and updated my last comment.

Jan 18 '19 23:01 Venemo

@Venemo: No idea about rpm spec stuff, but nothing stands out I guess. I just tried against mesa packages from debian, and it looks like the fps is roughly the same on sc2

Jan 19 '19 11:01 dhewg

While the si_pipe patch does help a lot, it's not perfect. Even though Gallium says that the GPU utilization is 99%, radeontop disagrees, and says the utilization is ~50% (up from the previous ~30%).

Jan 19 '19 11:01 Venemo

@dhewg , to reproduce the issue you need to move the video card to PCIE x4 slot.

As I said in my previous comment to you, a lot of MB have full sized slots that are x4.

Jan 19 '19 12:01 iiv3

@iiv3 yeah, I get that it's a bottleneck that I don't run into. Just confirming that it's not a general issue

Jan 19 '19 12:01 dhewg

I ported the change to r600_pipe_common.c::706 and run a benchmark. Without the change l4d2 got 83,55 and 83,72fps. After the change it got 84,27 and 83,92fps.

I'd say that it is a consistent improvement, even thought 0.5% is around the margin of error.

Jan 19 '19 13:01 iiv3

newscreenshot I get better performance with the constant compacting code, even though it's still not 100% usage.

Jan 23 '19 22:01 axeldavy

With vsync disabled in-game and thread_submit=true as well as tearfree_discard=true my gpu load is maxed out. That is with upstream mesa, so none of those new patches.

Jan 24 '19 07:01 dhewg

I built the latest master (as of today) from the ixit repo and ran SC2 with the mesa built from there. The results are definitely an improvement, but not as amazing as the radeonsi patch that I tried earlier.

Frame rate is ~50 fps in the same scene.
GTT usage is ~60-70 MB
VRAM usage is ~2-2.5 GB
GPU load is ~45-50%

I also tried combining this with the patch that replaces all PIPE_USAGE_STREAM in buffer9.c and nine_buffer_upload.c with PIPE_USAGE_DEFAULT.

No noticable difference in frame rate (maybe a little higher)
GTT usage is ~40-50 MB (20 MB lower)
No noticable difference in VRAM usage
GPU load is slightly higher around 50%

Jan 28 '19 17:01 Venemo

Investigated a bit further. I'm not 100% sure that the bottleneck is VS constants. I made the following changes:

si_pipe.c in si_create_context added the SI_RESOURCE_FLAG_32BIT to stream_uploader which allows it to receive shader constants
nine_state.c in prepare_vs_constants_userbuf changed the u_upload_alloc and u_upload_unmap calls to use the stream_uploader

This is a huge hack (the stream uploader is not supposed to handle shader constants), but it yields a huge improvement over everything else we've tried so far. It now reaches ~95 fps in the same scene. Note that this is even better than the version where I set all constants to use the stream uploader (which was ~80 fps).

Conclusion: something about the vertex shader constants in SC2 causes a bottleneck in nine when the PCI-E bandwidth is limited.

Jan 29 '19 10:01 Venemo

Can you try to switch const_uploader from DEFAULT to STREAM? If that's not enough, can you increase the size from 128 * 1024 to 1024 * 1024?

Jan 29 '19 20:01 marekolsak

Mesa-3D Mesa-3D copied to clipboard

Low frame rates on Starcraft 2 while CPU and GPU utilization is low

Here are the graphs from the patched nine:

I also examined the same graph while running wined3d with the PBA patch:

Mesa-3D
Mesa-3D copied to clipboard