RayTracingInVulkan icon indicating copy to clipboard operation
RayTracingInVulkan copied to clipboard

Performance investigations on RDNA2 cards.

Open PMunkes opened this issue 3 years ago • 23 comments

I ran the Raytracer through AMDs GPU profiler to check out how it runs on my RX6800. (I will upload the results in a different issue.) It reported that the RT shader is limited by it's LDS usage of 4 KB to half occupancy, and that it uses 80 vector registers. Decreasing LDS usage to 3072B could increase the occupancy up to 12 parallel wavefronts (warps on nVidia Hardware) the maximum for 80 Vector registers. This should improve performance as less time is spent idle.

Reducing LDS usage to 2048B would allow further optimizations to VGPR (Vector General Purpose Register) usage. Reducing VGPRS to 64 would allow full occupancy and presumably maximum performance.

Edit: I believe LDS is AMDs name for Workgroup Memory.

PMunkes avatar Feb 06 '21 15:02 PMunkes

This is way more complicated than the original comment stated. I need to study this more.

PMunkes avatar Feb 06 '21 19:02 PMunkes

Yes, I think that it might be specific to AMD as well as they use the normal shader resources for BVH tranversals. Register usage might be different on other cards.

I've tried to use NVIDIA Nsight Graphics 2021.1.0 for profiling the Vulkan Ray Tracing shaders in Scene 1 last night. It hinted that the main bottleneck was L1TEX ( I assume L1 cache) and MIO (Memory IO) but the feature is Beta and light on information, it doesn't seem to correlate yet with the source code (even SPIR-V bytecode) in all but the simplest case, so it's hard to identify the actual location of bottlenecks in the shaders.

GPSnoopy avatar Feb 07 '21 14:02 GPSnoopy

Yeah, I tried to use AMD's static Analyzer (RGA) yesterday, but they have not released an update with the latest SPIR-V tools.

AMD's Vulkan traversal code seems under-optimized at the moment, especially in comparison with the DX12 one. I have attached some traces. Metro Exodus seems pretty much like the most optimized RT on AMD that I found until now. And even Control (which runs awfull) seems to use half the LDS of the Vulkan examples.

Trace from this Project: grafik Trace from Quake2 RTX: grafik Trace from Metro Exodus (DX12): grafik Trace from Control (Corridor of DOOM, as Digital Foundry puts it) (RT passes: BVH, reflections, diffuse Reflections, GI, Contact shadows): grafik

PMunkes avatar Feb 07 '21 16:02 PMunkes

I think I figured the performance difference out. RT-Heatmap If you take a look at the heatmap, you can see that the heatmap is relatively coarse, specifically it's batches of 8x8 pixels, or 64 pixels. This means that the wavefront runs until the last ray has terminated. Combined with the currently relatively low occupancy, this could explain the odd results.

PMunkes avatar Feb 09 '21 17:02 PMunkes

It could probably be faster to trace a single sample per shader and dispatch a bunch of samples per pixel at the same time, no idea if it's possible though.

PMunkes avatar Feb 09 '21 17:02 PMunkes

This hypothesis seems convincing. I ran the raytracer at 5120x2880 and at 1280x720: grafik grafik The 64-pixel blocks are very visible in 720p and cover a lot ot the central statue, while in 2880p the hot blocks less widely distributed. This looks much closer to the nVidia heatmap in README.md and corresponds more to the naively expected behaviour.

To test if it improves performance, I ran the benchmark at identical Ray counts, (2880p has 16x the pixels of 720p), at different resolutions (60s every scene):

Resolution\Scene Rays per frame Scene 1 Scene 2 Scene 3 Scene 4 Scene 5
720p 16 Samples 14,745,600 70.02 69.37 30.60 55.71 20.23
2880p 1 Sample 14,745,600 71.67 70.76 35.51 54.79 20.33
Performance of higher res 100% 102,36% 102% 116,05% 98,35% 100,5%

The performance is highly scene dependent, with most scenes being more or less within the margin of error. Scene 3 with the Lucy statues is improved by quite a lot, just as expected from the heatmaps.

PMunkes avatar Feb 09 '21 22:02 PMunkes

The bad performance in the Cornell box scenes make a lot of sense, since there are only very few triangles (36 if I count correctly) making up the scene. Ampere improved twofold on Ray-Triangle-Intersections, which is basically all that happens in the Cornell Box. On real models, like Lucy, the BVH-hirarchy should be much deeper. If we assume a BVH4 (like the RDNA2 ISA-Guide implies), then the 448K triangles of the Lucy Statues should be in a BVH of more than 10 Layers deep each. The Cornell Box on the other hand should be in a BVH Box about 5 layers total. The Cornell box is a micro benchmark of Ray-Triangle performance, so it's no wonder that Ampere runs so fast here.

PMunkes avatar Feb 09 '21 23:02 PMunkes

I did some more testing and found a way to force wave32 execution on the RT shaders on RDNA2. This improved performance by ~6%. Moving line 101 directly behind the load in line 85, reduced the amount of used vector registers and increased performance further: FPS table:

execution mode Scene 1 Scene 2 Scene 3 Scene 4 Scene 5
wave32 optimized 48.78 48.14 23.12 36.26 13.83
wave32 43.67 43.09 20.13 33.09 12.04
wave64 (base) 40.99 40.66 18.94 31.59 11.29

Improvements over base:

execution mode Scene 1 Scene 2 Scene 3 Scene 4 Scene 5
wave32 optimized 19.00% 18.40% 22.07% 14.78% 22.50%
wave32 6.54% 5.98% 6.28% 4.75% 6.64%
wave64 (base) 0.00% 0.00% 0.00% 0.00% 0.00%

Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics.

I will open a pull request later.

PMunkes avatar Mar 08 '21 18:03 PMunkes

I recently found AMD's video on GPUOpen that covers some perform tips for DXR 1.1. https://gpuopen.com/videos/amd-rdna2-directx-raytracing/

At least in DirectX12, AMD recommends moving TraceRay to the compute queue and dispatching in 8x4 tiles. This dispatch size appears to line up with your findings, namely the optimality of wave32 and the LDS pressure.

CasperTheCat avatar Mar 19 '21 07:03 CasperTheCat

It does feel like it's on AMD to do these optimisations automatically in their JIT compiler. Keep in mind that NVIDIA VK ray tracing performance pretty much doubled since it was introduced two years ago, purely thanks to driver improvements. I'm hoping AMD can address some of the low hanging fruits relatively quickly.

GPSnoopy avatar Mar 19 '21 10:03 GPSnoopy

Well the newest driver automatically defaults to wave32, so that's good. Makes the first part of my pull request unnecessary.

PMunkes avatar Mar 20 '21 00:03 PMunkes

Second Part also does nothing anymore, so they are indeed working on it.

PMunkes avatar Mar 20 '21 00:03 PMunkes

I think I have an avenue for some improvements (Page 21): http://www.cs.uu.nl/docs/vakken/magr/2016-2017/slides/lecture%2003%20-%20the%20perfect%20BVH.pdf

PMunkes avatar May 10 '21 09:05 PMunkes

Split box-quads into smaller polygons?

GPSnoopy avatar May 10 '21 12:05 GPSnoopy

That was my Idea.

PMunkes avatar May 10 '21 15:05 PMunkes

I'm currently implementing a (very) primitive function that splits all triangles given to it in half. It does this recursively.

PMunkes avatar May 10 '21 16:05 PMunkes

This was a dead end unfortunately. I only got performance degradation using this "tesselation" function:

std::function<void(std::vector<Vertex>&, std::vector<uint32_t>&, int)> divideTriangles = [&](std::vector<Vertex>& Vertices, std::vector<uint32_t>& indices, int depth = 4) {
	if (depth <= 0) return;
	for (int i = 0; i < indices.size(); i += 6)
	{
		double length = 0.0;
		std::array<std::pair<uint32_t, uint32_t>, 3 > edges{ {{0,1},{0,2},{2,1}} };
		size_t pair = 0;
		for (size_t j = 0; j < 3; j++)
		{
			auto edge = edges.at(j);
			vec3 pos1 = Vertices.at(indices.at(i + edge.first)).Position;
			vec3 pos2 = Vertices.at(indices.at(i + edge.second)).Position;
			auto dist = distance(pos1, pos2);
			if (length < dist)
			{
				length = dist;
				pair = j;
			}
		}
		std::vector<uint32_t>::iterator nth = indices.begin() + 3;
		auto edge = edges.at(pair);
		vec3 pos1 = Vertices.at(indices.at(i + edge.first)).Position;
		vec3 pos2 = Vertices.at(indices.at(i + edge.second)).Position;
		vec3 difference = pos1 - pos2;
		difference /= 2.0;
		Vertex newVertex = Vertices.at(indices.at(i + edge.second));
		newVertex.Position += difference;
		uint32_t VertexIndex = Vertices.size();
		Vertices.push_back(newVertex);
		//prepare new triangle:
		//prepare new triangle:
		std::vector<uint32_t> newindices{};
		switch (pair) {
		case 0:
			newindices = std::vector<uint32_t>{ { indices.at(i),indices.at(i + 2),VertexIndex,indices.at(i + 1),indices.at(i + 2),VertexIndex } };
			break;
		case 1:
			newindices = std::vector<uint32_t>{ {indices.at(i),indices.at(i + 1),VertexIndex,indices.at(i + 1),indices.at(i + 2),VertexIndex } };
			break;
		case 2:
			newindices = std::vector<uint32_t>{ {indices.at(i),indices.at(i + 1),VertexIndex,indices.at(i),indices.at(i + 2),VertexIndex } };
			break;
		}
		for (size_t j = 0; j < 3; j++)
		{
			indices.at(i + j) = newindices.at(j);
		}
		indices.insert(nth, newindices.begin() + 3, newindices.end());
	}
	divideTriangles(Vertices, indices, depth - 1);
};

PMunkes avatar May 10 '21 18:05 PMunkes

On the plus side I now have an excellent tool to test the importance of the L3 cache for RT on RDNA2.

PMunkes avatar May 10 '21 19:05 PMunkes

Not too surprising. Internally you would expect the drivers to do this if it was beneficial.

GPSnoopy avatar May 10 '21 20:05 GPSnoopy

It was an experiment. I was fascinated how the performance dropped when increasing the "tesselation factor." Scene 5 was completely unaffected (probably due to the complex Lucy model), and Scene 4 dropped by ~30% when I increased the amount of polygons to 64x.

PMunkes avatar May 10 '21 21:05 PMunkes

I would love to see how Ampere would fare with this. The only thing that should be changing with the increased number of triangles is the depth and size of the BVH. The amount of Ray-Triangle-intersections should stay exactly the same.

PMunkes avatar May 11 '21 12:05 PMunkes

I'm just passing through!

Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics.

Does VK_EXT_subgroup_size_control work for RayTracing PSO? That extension lets you explicitly control wave32/64 mode.

darksylinc avatar Jul 12 '21 02:07 darksylinc

Also passing through. I'm currently researching the relative performance between Vulkan Ray Tracing and DXR. Do you guys have an impression of this? Has it more or less reached parity? I'm only interested in open APIs like Vulkan, however it would be good to be aware of any shortcomings (if any) that it might currently have compared to DX12.

Also hoping to see Vulkan Ray Tracing supported efficiently on Metal via MoltenVK. Wonder how far off that'll be.

unphased avatar Jan 29 '22 04:01 unphased