xenia
xenia copied to clipboard
Non-geometry-shader path for rectangle list topology
This is a follow-up to https://github.com/xenia-project/xenia/issues/596#issuecomment-416264190 . Usage of geometry shaders is one of the current blockers for running on Metal. It would be great to explore the alternatives. cc @Triang3l
That would be necessary to run Xenia on Vulkan on certain mobile devices. I'd probably use transform feedback for that, store transformed vertices in a buffer, and then use that data in triangle lists where every 6th vertex (checked using gl_VertexIndex) would be extrapolated. That would probably need running the vertex shader for VK_PRIMITIVE_TOPOLOGY_POINT_LIST with manual index buffer fetching (since we need the original connectivity).
Could you use compute shaders instead? What exactly is the use case for the geometry shaders in Xenia?
Compute shaders can be used for the same purpose of transform feedback, just don't really need work groups there. The primary use cases are constructing the fourth vertex of rectangles defined by three vertices, and expanding points into sprites.
I was just surprised to see your suggestion to use transform feedback instead of geometry shaders, since TF is not a part of Vulkan. Compute shaders are.
I was referring to the overall concept. If you can write from vertex shaders to buffers, both vertex and compute will work. But yes, maybe compute shaders will be faster for this task, since they would involve using less parts of the pipeline, the only difference will be if (invocationIndex >= vertexCount) return; in the beginning. As long as we don't hit some invocation count limit if there is one.
I don't think we'll be going the compute shader route as that will result in lots of render pass breaking, and, with tiling, potentially tile loads and stores if the driver decides to actually trigger them. It seems that it's pretty trivial to "merge" the logic of the vertex shader and the geometry shader at the cost of longer vertex shader execution (one shader invocation will be processing 3 vertices).
For that to work, while also not executing the host vertex shader multiple times for the same vertices, we'll need to use a triangle strip with a built-in index buffer laid out like "0, 1, 2, 3, restart, 4, 5, 6, 7, restart…" until the 65535 vertex count limit on the Xenos. Alternatively, we can use instanced triangle strips with a fixed number of 4 vertices per instance, though for instancing, investigation is needed regarding whether vertices of different instances can be put in the same wavefront on all (primarily mobile) hardware. This will also be needed for emulating point sprites without geometry shaders (considering all 4 invocations for the same point index will output the same data — but for points, just need to make sure memexport is done only once for all four host vertices; this will still ruin the case of memexport writing to the same buffer as vfetch is done from in the shader, however, as we can't establish any execution or memory ordering between different host vertices — this can be handled only by moving the memexport for points to a compute dispatch after the draw, similar to what we'll be doing on Vulkan implementations not supporting memory writes from vertex shaders, at least in case of overlap between vfetch and memexport memory ranges).
Because the index buffer will be fixed, and since multiple vertices will be processed by one invocation, the index buffer will have to be fetched manually in the shader, just like the vertex buffers, from the shared memory buffer binding (this is also needed for points as I said earlier, as well as for endian swapping of 32-bit indices on Vulkan implementations supporting only 24-bit indices).
Taking into account that:
0---1
| /|
| / | - 12 is the longest edge, strip 0123 (most commonly used)
|/ | v3 = v0 + (v1 - v0) + (v2 - v0), or v3 = -v0 + v1 + v2
2--[3]
1---2
| /|
| / | - 20 is the longest edge, strip 1203
|/ |
0--[3]
2---0
| /|
| / | - 01 is the longest edge, strip 2013
|/ |
1--[3]
the logic of the vertex shader will be:
uint rectangle_index = gl_VertexIndex >> 2u;
uint rectangle_invocation = gl_VertexIndex & 3u;
// No v2_* because gl_Position, xe_interpolators and xe_kill can be used for v2.
vec4 v0_position, v1_position;
vec4 v0_interpolators[INTERPOLATOR_COUNT], v1_interpolators[INTERPOLATOR_COUNT];
float v0_kill, v1_kill;
uint triangle_vertex_index;
[[unroll]] for (triangle_vertex_index = 0; triangle_vertex_index < 3; ++triangle_vertex_index) {
r0.x = XeFetchVertexIndex(rectangle_index * 3u + triangle_vertex_index);
// Here, run the guest vertex shader writing to gl_Position, xe_interpolators, xe_kill.
// If memexport is used, eA/eM must be written only if triangle_vertex_index == rectangle_invocation.
// This will make sure memexport is done only once per triangle vertex, and not done for the 4th vertex at all.
if (!vtx_w0_fmt) {
gl_Position.w = 1.0 / gl_Position.w;
}
if (triangle_vertex_index == 0u) {
v0_position = gl_Position;
v0_interpolators = xe_interpolators;
v0_kill = xe_kill;
} else if (triangle_vertex_index == 1u) {
v1_position = gl_Position;
v1_interpolators = xe_interpolators;
v1_kill = xe_kill;
}
}
// Find the longest edge and get the order of the vertices in the strip.
vec2 edge_12 = gl_Position.xy - v1_position.xy;
vec2 edge_20 = v0_position.xy - gl_Position.xy;
vec2 edge_01 = v1_position.xy - v0_position.xy;
vec3 edge_squares = vec3(dot(edge_12, edge_12), dot(edge_20, edge_20), dot(edge_01, edge_01));
uint strip_start_vertex_index;
if (edge_squares.x > edge_squares.y && edge_squares.x > edge_squares.z) {
// 0123 - mirror across 12.
strip_start_vertex_index = 0u;
} else if (edge_squares.y > edge_squares.z) {
// 1203 - mirror across 20.
strip_start_vertex_index = 1u;
} else {
// 2013 - mirror across 01.
strip_start_vertex_index = 2u;
}
if (rectangle_invocation < 3u) {
// This vertex belongs to the triangle consisting of the original vertices.
// Choose which vertex in the strip this invocation belongs to.
uint strip_vertex_index = strip_start_vertex_index + rectangle_invocation;
if (strip_vertex_index >= 3u) {
// Modulo 3.
strip_vertex_index -= 3u;
}
if (strip_vertex_index == 0u) {
gl_Position = v0_position;
xe_interpolators = v0_interpolators;
} else if (strip_vertex_index == 1u) {
gl_Position = v1_position;
xe_interpolators = v1_interpolators;
} else {
// v2 - the final variables already contain the values for v2 from the last loop iteration.
}
} else {
// This vertex is the fourth vertex, constructed by mirroring.
// Take the vertex to mirror (the "origin" of the resulting parallelogram) with the minus sign, the rest with the plus sign.
vec3 signs = mix((1.0).xxx, (-1.0).xxx, strip_start_vertex_index.xxx == uvec3(0u, 1u, 2u));
gl_Position = v0_position * signs.x + v1_position * signs.y + gl_Position * signs.z;
xe_interpolators = v0_interpolators * signs.x + v1_interpolators * signs.y + xe_interpolators * signs.z;
}
if (vtx_xy_fmt) {
gl_Position.xy *= gl_Position.w;
}
if (vtx_z_fmt) {
gl_Position.z *= gl_Position.w;
}
// Here, apply user clip planes with the final position.
// Here, handle primitive kill as vertex kill flag AND or OR (depending on PA_CL_CLIP_CNTL) by writing NaN to gl_Position.
[[unroll]] here means spv::LoopControlUnrollMask — this can be used to suggest the driver to unroll the loop and eliminate the conditional copying for the first two vertices without actually unrolling the loop by ourselves (which would require reworking the architecture of the shader translator for it to be able to translate the guest shader multiple times within the host shader); alternatively a function call may be used to invoke the translated guest shader multiple times, but that would make variable management (especially related to memexport) less convenient probably.