rpcs3 icon indicating copy to clipboard operation
rpcs3 copied to clipboard

BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths

Open Whatcookie opened this issue 9 months ago • 6 comments

Uses vpcompress to vectorize this otherwise unvectorizeable loop the u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512 the u32 path is untested as I couldn't find any games that hit it

We use vcompress register to register, rather than directly to memory since there's a bug with vcompress to memory on zen4, which makes it exceedingly slow. In the future, we could detect this and emit the optimal instructions in the jit instead. But the code is already so fast that it might not be worth the effort.

The code is overall nearly 10x faster than the scalar version on my zen4 machine.

Before: image After: image

Before: image After: image

Whatcookie avatar Mar 26 '25 22:03 Whatcookie

Tried NieR Replicant Mailbox and Fountain, Minecraft Menu and Tutorial, Diva F 2nd Menu, no performance difference on my side on any of these cases with 9800X3D + 6800XT

AniLeo avatar Mar 27 '25 00:03 AniLeo

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

  1. Move the different feature levels to separate files leaving the generic implementation here.
  2. We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
  3. When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

  1. Move the different feature levels to separate files leaving the generic implementation here.
  2. We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
  3. When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

I think I need to recover the old sse4.1 paths since neko removed them in favor of emitting x86 instructions directly in the jit, which won't work on arm

Whatcookie avatar Mar 29 '25 00:03 Whatcookie

The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic.

kd-11 avatar Mar 29 '25 00:03 kd-11

The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic.

They're guarded by x86_64 ifdefs in this file, aren't they?

Whatcookie avatar Mar 29 '25 00:03 Whatcookie

@Whatcookie needs a rebase

Megamouse avatar May 12 '25 20:05 Megamouse

It seems this PR, providing a good optimization, is blocked more for a code re-organization. Maybe the requested changes on code organization are applied soon so this PR can be considered for a merge

digant73 avatar Nov 14 '25 14:11 digant73