xemu
xemu copied to clipboard
Z24S8 surface upload/download slow on mesa radeonsi
Profiling Azurik: Rise of Perathia under Linux with an AMD GPU shows a very large amount of time spent in the function _mesa_texstore_s8_z24
, called when a Z24S8 surface is updated with glTexImage2D/glTexSubImage2D.
AMD GPUs don't support Z24S8 formats directly in HW so the OpenGL driver goes mad on converting them, this is causing slowdowns to 1-2 fps on my Ryzen 7 3700U.
Possible solutions:
- Be more aggressive in trying to skip Depth/Stencil download and uploads.
- Use some form of shader to convert Z24S8 to a hw-supported format instead of letting the gl driver convert the format on CPU
OR
- Copy a different format to RAM and hope for the best (probably won't work)
Other games affected
- Baldur's Gate: Dark Alliance II
- Legacy of Kain: Defiance
Notes on the possible shader solution:
CPU->GPU, upload as 32-bit uint texture then render to FBO with depthstencil attachment.
Writing stencil values: https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_stencil_export.txt
This extension is supported on mesa. Would need to add codepath to disable this shader if not supported.
Depth values write to gl_FragDepth
GPU->CPU, need to render to 32-bit uint texture then download.
Use depth texture + stencil view tex (seems to require OpenGL 4.4): https://stackoverflow.com/questions/27535727/opengl-create-a-depth-stencil-texture-for-reading
@CallumDev This isn't only the case on AMD hardware; this storage format conversion will also happen with Nvidia, etc and is also very expensive, it definitely needs to be made faster because in some cases synchronization cannot be avoided. It's been on my radar for a while; but if you would like to explore and work on this you are welcome to, or I will get to it eventually.
More notes:
- For mesa, downloading the surface seems to be just as quick as downloading any other, upload is the problem.
I've had success uploading stencil data by setting the GL state so the func is GL_REPLACE, GL_REPLACE, GL_REPLACE
, disabling color mask and using this shader with a full screen quad to sample from a R32UI tex (min and mag filters must also be set to GL_NEAREST). gl_FragDepth is untested, that's just a guess at this point. As for integrating into xemu, I'm concerned about trampling all the GL state inadvertently.
GL_ARB_shader_stencil_export
is required to support this as well which only seems to be supported on AMD and Intel - not nVidia. However with reports of Azurik running ok on nVidia, perhaps the performance hit is much less for them?
#version 440
#extension GL_ARB_shader_stencil_export : require
in vec2 uv;
uniform usampler2D depthstencil_tex;
void main(void) {
uint sval = texture(depthstencil_tex, uv).r;
gl_FragStencilRefARB = int(sval & 0xFFu);
gl_FragDepth = float(sval >> 8) / 16777215.0;
}
Using the shader to upload avoids the huge FPS drop (6 fps for one 1024x768 surface in my test case).
Also effects: https://xemu.app/titles/43430002/#Steel-Battalion https://xemu.app/titles/43430009/#Steel-Battalion-Line-of-Contact https://xemu.app/titles/41560009/#Rally-Fusion-Race-of-Champions
EDIT: As for the Nvidia comment, I can confirm that while there is definitely a negative performance impact on Nvidia it's not nearly as bad on Nvidia, but still bad. Azurik, Steel Battalion, and Rally Fusion run worse on AMD than Nvidia, but still are not running nearly as well as hardware on either AMD or Nvidia, on either Windows or Linux
In the clear case, we can be smarter about not uploading if we are about to do a full surface clear
Issue present in: https://xemu.app/titles/45410042/#007-Everything-or-Nothing
This issue may be resolved by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18484?commit_id=4da147a02b541311e8dc231b30dd36fafea820ff
TODO: Test when this makes it into a stable mesa release (22.3)
Wonder if this might also help #777 ?
This issue may be resolved by https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18484?commit_id=4da147a02b541311e8dc231b30dd36fafea820ff
TODO: Test when this makes it into a stable mesa release (22.3)
I suppose one of us could try building with this commit and see if it resolves it. I might mess with it Monday. It would definitely be great if it does because the number of games effected by this issue has increased over time.
I can confirm wtih 22.3-devel that the performance of at least Azurik: Rise of Perathia is much improved (previously it was <1fps). In less complex areas it will hit 30fps.
CPU: 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
OS_Version: Fedora Linux 36 (Workstation Edition)
GL_VENDOR: Intel
GL_RENDERER: Mesa Intel(R) Xe Graphics (TGL GT2)
GL_VERSION: 4.6 (Core Profile) Mesa 22.3.0-devel
GL_SHADING_LANGUAGE_VERSION: 4.60
On Mesa 22.3.0-devel, Halo 2 splitscreen is no longer 1 fps on my machine. However, the graphics are messed up, so split screen is not playable as yet. See #1237
Built mesa 22.0.5 with "speed up glTexImage" patch on ubuntu 22.04. I tried Panzer Dragoon Orta, which before ran below 10fps, and now it is almost locked 60fps in my xeon e5-1270 v3 system. Very impressive stuff.
I just tested Panzer Dragoon Orta with mesa 22.3.0-1 on arch and the speedup is crazy good, just as reported by @resadent. I suggest closing this and #777 as well.
This shouldn't be closed until mesa-22.3.1 is in release since that's considered the first stable release.
EDIT: It doesn't fix intel or nvidia's issues either.
I just tested Panzer Dragoon Orta with mesa 22.3.0-1 on arch and the speedup is crazy good, just as reported by @resadent. I suggest closing this and #777 as well.
Issue is still present in Steel Battalion: Line of Contact even with the latest mesa-22.3.2
. I think this issue should remain open until all effected games are confirmed working. It's possible the issue with LoC is no longer related to this issue but i don't have time to profile it right now.
EDIT: Also as of now Jan 21st with the lastest Mesa HEAD the issue is still preset so and Panzer Dragoon does still have major slowdowns dropping all the way down to 7FPS.