xenia icon indicating copy to clipboard operation
xenia copied to clipboard

Implement memexport.

Open benvanik opened this issue 9 years ago • 13 comments

Boom Boom Rocket sets up export.

benvanik avatar Feb 11 '15 05:02 benvanik

Lumines Live

benvanik avatar Feb 11 '15 18:02 benvanik

This is going to be difficult - as memexport writes straight to main memory. My guess is (at least some) games use memexport as transform feedback.

...
/*   84   */          mad eA, r0.xxxx, c5.xyxx, c255
/*   85   */          max eM0, r1.xxxx, r1.xxxx
/*   86   */          max eM1, r1.yyyy, r1.yyyy
... up to eM4

eA.x = main memory address >> 2 (uint32 aliased as float) eA.yzw = ???

Writes to eA are restricted to mad only. Perhaps they're using some weird tricks?

DrChat avatar May 04 '18 19:05 DrChat

Just to keep this info somewhere in a more persistent place than Discord:

The constant multiplicand, according to the usage in Halo 3 and Banjo-Kazooie: Nuts & Bolts and to Advanced Screenspace Antialiasing, is (0.0, 1.0, 0.0, 0.0).

eA.x, as uint, is physical address in dwords | 0x40000000 — checked by comparing the register value and the index buffer pointer in draw calls using tessellation (the index buffer contains per-edge tessellation factors as float32 rather than indices in this case).

eA.y, appears to be offset in dwords | 0x4B000000 (unless there is some element stride and that's element offset, I still don't know that, but that's not very likely especially considering the slide from Advanced Screenspace Antialiasing calculates the offset from IntOffsets). Adding an integer converted to a float to 2.0^23 puts it in the low mantissa bits, that's how mad is used to write an integer using floating-point ALU.

eA.z is something unknown, possibly some flags. In the Halo 3 tessellation edge factor calculation shader, it's 0x4B07E4FA, in Banjo-Kazooie: Nuts & Bolts for the shader with the same purpose, it's 0x4B07E46A.

eA.w is buffer size in dwords | 0x4B000000.

Triang3l avatar Dec 17 '18 10:12 Triang3l

It does also pack things, and apparently not only to 32 bits, but to larger vectors also. And in this case, the size in W is in vectors, not in dwords (and the offset in Y probably too) — in the shader from Halo 3 menu, the scale of W for the same buffer depends on the format.

Here are stream constant Z values from some shader from the menu of Halo 3:

  • 0x4B072602 — 32_32_32_32_FLOAT
  • 0x4B072001 — 16_16_16_16_FLOAT
  • 0x4B001A01 — 16_16_16_16
  • 0x4B071F01 — 16_16_FLOAT
  • 0x4B001901 — 16_16
  • 0x4B010702 — 2_10_10_10 (signed)
  • 0x4B080602 — 8_8_8_8

What we can see about the bits:

  • 0:1 (or maybe 0:2, but maybe not) — endianness.
  • 8:13 — format.
  • 16 — signedness.
  • 17 or 18 — fractional/integer.

alloc export = 1 or = 2 depends on some "tile alignment" according to https://forum.beyond3d.com/threads/geometry-shader-whats-the-difference-from-vs.24072/

Triang3l avatar Dec 18 '18 21:12 Triang3l

According to the PDB from Call of Duty 4 alpha, Z of GPU_MEMEXPORT_STREAM_CONSTANT consists of:

  • 0:2 — GPUENDIAN128 EndianSwap
  • 8:13 — GPUCOLORFORMAT Format
  • 16:18 — GPUSURFACENUMBER NumericType (0 = UREPEAT, 1 = SREPEAT, 2 = UINTEGER, 3 = SINTEGER, 7 = FLOAT)
  • 19 — GPUSURFACESWAP ComponentSwap (0 = LOW_RED, 1 = LOW_BLUE)
  • 20:31 — 4B0

Doesn't explain, however, why there is some data in bits 3:7 and 14:15 in the shaders for edge tessellation factor calculation in Halo 3 and Banjo-Kazooie: Nuts & Bolts, possibly leftovers on the stack.

Triang3l avatar Dec 19 '18 14:12 Triang3l

This is still a meme xport

RonDaBlue avatar Dec 19 '18 14:12 RonDaBlue

According to the PDB from Call of Duty 4 alpha, Z of GPU_MEMEXPORT_STREAM_CONSTANT consists of:

  • 0:2 — GPUENDIAN128 EndianSwap
  • 8:13 — GPUCOLORFORMAT Format
  • 16:18 — GPUSURFACENUMBER NumericType (0 = UREPEAT, 1 = SREPEAT, 2 = UINTEGER, 3 = SINTEGER, 7 = FLOAT)
  • 19 — GPUSURFACESWAP ComponentSwap (0 = LOW_RED, 1 = LOW_BLUE)
  • 20:31 — 4B0

Doesn't explain, however, why there is some data in bits 3:7 and 14:15 in the shaders for edge tessellation factor calculation in Halo 3 and Banjo-Kazooie: Nuts & Bolts, possibly leftovers on the stack.

Differences between how the engines of both games use the 360 GPU resources to store and write data might be to blame for this behavior.

tetration avatar Dec 19 '18 15:12 tetration

Differences between how the engines of both games use the 360 GPU resources to store and write data might be to blame for this behavior.

Only two bits were different between Halo 3 and BKN&B though, 0x4B07E4FA in Halo 3 and 0x4B07E46A in BKN&B. But this isn't relevant anymore, I think. The info we have seems to be enough for implementing.

I'm not, however, sure what happens if you export not all components of a single vector, and whether the destination formats can include formats smaller than 32bpp, but I've never seen that happening so far. Those would have to be taken into consideration because in this case loading of the existing value in the RWByteAddressBuffer, shifting and masking would have to be done… ewww…

Triang3l avatar Dec 19 '18 15:12 Triang3l

Mostly done in the Direct3D 12 backend. Only seen Halo: Reach (IIRC) exporting data in the 8_8_8_8_A format that we don't support yet. If sub-32bpp formats are ever encountered, two R16 and four R8 UAVs (because of Nvidia's 128 megatexels limit for buffers), or just the latter, will need to be added.

Triang3l avatar Jun 01 '20 22:06 Triang3l

Split/Second uses memexport to k_8 — for this purpose, due to the Nvidia's 128M-texel limitation (though I'm not yet sure if it applies to UAVs or only to SRVs, will check, but overall better to have some safety measures), we'll need to bind 4 shared memory DXGI_FORMAT_R8_UINT texel UAVs in addition to the RWByteAddressBuffer (using 32-bit accesses with masking won't work as there will be a race between adjacent texels), and write 8-bit and 16-bit data through those views. On Vulkan, they will have to be storage texel buffers (as opposed to just storage buffers), and they'll count towards the storage images (not the storage buffers) per stage limit. Also there is another limit regarding the maximum size of a storage texel buffer — which can be as low as 65536, we'll need some lower bound (likely 128M) for allowing sub-32bpp memexport on Vulkan.

Triang3l avatar Jun 18 '21 20:06 Triang3l

RWByteAddressBuffer can likely be used with atomic_and to erase the byte/word to overwrite, and atomic_or to write the new data, instead of binding more UAVs (which is also problematic because on feature level 11_0 and resource binding tier 1 hardware, the number of UAVs is limited to 8 across all stages — while with 4 more UAVs the total count will be 10 if memexport is used in both VS and PS, or 11 with the EDRAM ROV).

Note that a imm_atomic_cmp_exch loop probably should not be used for this purpose — it may probably become infinite if different threads try to write different values to the same byte/word, or in case of out-of-bounds or unmapped tile access (in which case the previous value will likely always be 0).

Triang3l avatar Jun 07 '22 13:06 Triang3l

From the preliminary R400 sequencer specification version 2.11 from IPR2015-00325:

alloc-mem-export - proceeds any memory-address, memory-data exports. There can be multiple alloc-mem-export statements in either kind of shader. All exports for mem-exports must execute between the corresponding alloc-mem-export and the next yield point or resource change.

Also, when doing a pass thru export, the shader must still do either a position and PC export (if Vertex) or a color export (if Pixel). The pass thru export can occur anywhere in any shader program and thus can be used to debug. There can be any number of pass thru export blocks throughout the pixel or vertex shader or both.

It's clearly not correct that we're doing all those exports in the end of the shaders currently (and so that any kill in the shader discards absolutely all the exports).

"Yield point", if I understand correctly, means serialize there.

But given that:

When exporting to more than EM0, one MUST write to EM4 also (the write may be predicated if you don’t need the export).

(note that this is outdated as it's from an early specification, games don't do that EM4 write in reality)

Since whether an export happens or not depends on predication, I think it may be safe to assume that the writes on the host may even be performed directly by the ALU instruction doing the export when emulating it? As long as eM# is never written before eA — is that the case in any game?

Another part that we aren't taking into account currently:

The SX will check for invalid writes and mask out the data so it won’t be written to memory. Invalid writes are:

  1. Index value >= Max Index value
  2. bit 31 != 0 (negative index)
  3. bits [30:23] != 23 + IEEE_EXP_BIAS (127) (meaning the index was too big to be represented using 23 bits)

Triang3l avatar Apr 17 '23 16:04 Triang3l

According to the shader validator from XNA:

  • "eA must be written like: mad eA, r*, Constant0100, AddressConst. No other vector op is allowed.", although there are incorrect mads in Halo 3.
  • eA must be written before eM#.
  • You must not write to eA more than once per alloc export.
  • You can write to different components of eM# from separate instructions, including not writing to some components at all (though no idea what happens if that's the case).
  • An ALU or fetch instruction with serialize ends the allocated export, though the shader is not required to do have any serialize instructions between alloc export.

So we obviously can't just export directly from the eM# write. Instead, I can see three locations where the current memexport sequence must be flushed:

  • alloc export.
  • serialize ALU/fetch instructions.
  • kill possibly (you can place one between, for example, writes to some eM# and to other eM# of the same exports — I don't know what should happen, but I think that must not result in the whole export being dropped (like if you discard before actually committing all those eM# writes to a UAV on the host), instead, only the post-kill eM# writes probably should be, to make kill work somewhat like predicated eM# writes from this point of view. However, I totally don't know what should happen in reality if that's the case.

All this is pretty painful though when control flow (including loops) is involved. Specifically, ideally we shouldn't be inserting the huge code doing the memexport on the host at any serialize encountered, but rather, only at those that might have been preceded by an alloc export not closed by a serialize. But the wavefront can jump arbitrarily, have loops that aren't even structured (skip address not after the loop, continue address not after the beginning instruction). So before we implement memexport correctly, I think we need some construction of blocks with predecessors and successors so we can analyze whether (and which) eM# need to be flushed at any given location.

Note that ridiculous loop skip/continue targets are perfectly accepted by the validation, we must not assume that loops are nicely structured in the input shaders.

Triang3l avatar Apr 18 '23 16:04 Triang3l

53f98d1fe6e6d2b52b9a1f741f83ec5f6856e146 adds more flexibility and security to the implementation:

  • Exports can now be done any number of times by the control flow program, including in loops. They're flushed to the memory at alloc instructions now, at the end of the shader, and before killing a pixel.
  • 8-bit and 16-bit formats are now supported (via atomic and/or clearing and writing the bits into the dword).
  • Per-element bound checking is done.
  • Exports are now dropped if eA has incorrect upper bits (0b01 for the address, exponent 23 for the rest — the real hardware is supposed to do that for the index at least).
  • More correct format packing, with fixed-point format writes flushing NaN to 0 regardless of the signedness and fractional/integer.

The only remaining part until this issue can be closed is a Vulkan implementation.

Triang3l avatar May 06 '23 12:05 Triang3l

Note about interaction of memexport with sample-frequency shading on different host GPU APIs: https://github.com/xenia-project/xenia/issues/2028#issuecomment-1632291553

Triang3l avatar Jul 12 '23 11:07 Triang3l