GPU-Raytracer icon indicating copy to clipboard operation
GPU-Raytracer copied to clipboard

General questions

Open Pjbomb2 opened this issue 2 years ago • 25 comments

Dont know where else to ask this, but i wanted to ask as this project is super impressive I was wondering where you learned about how to implement all these things in this project, as its something i am interested in myself also for building the CWBVH, does it use SBVH or the standard binary BVH? if it uses the standard binary BVH, where did you learn how to implement that, as i am trying to myself(to work up to CWBVH), but keep having issues Also also, while i dont see anything in the code, i was wondering if theres a way for me to visually see the BVH itself? again, trying to implement it into my own project to learn, but as I dont know what the original looks like, its hard to see where im going wrong Thanks! and sorry for the trouble

Pjbomb2 avatar Oct 18 '21 17:10 Pjbomb2

This project started out as an assignment for a university course on graphics, so that's where I got the basics behind SAH BVHs etc. The stuff after that is mostly from the papers mentioned in the readme.

CWBVH uses the standard BVH as its starting point. The SBVH optimizes for spatial splitting into 2 parts, but that becomes somewhat pointless if you then collapse it into an 8 way tree. Iirc the original paper mentions spatial splits as possible future work, but I don't think there was any follow up on that.

There is no BVH visualization support inside the engine, but a nice trick I have personally used while debugging BVH problems is to write AABBs into OBJ files (see https://github.com/jan-van-bergen/GPU-Pathtracer/blob/4fc474d50e13376cd59f22ff7affb7e4b3c25cd0/Src/Math/AABB.h#L56). That way you can visualize the BVH inside Blender for example. Bit of a hack, but simple and effective.

jan-van-bergen avatar Oct 18 '21 21:10 jan-van-bergen

Thanks! where do I call that debug bit to create an OBJ from the whole BVH? I might be able to put it in unity and use the overdraw to see depths better oh btw ive been trying to convert this into C# but the BVH ive tried so far keeps missing triangles(idk what the origional one looks like as it seems that data is encrypted? or at least not easily readable by default text programs) oh also what structure do you pass the compute shader for storing triangles and BVH? ive been trying to dissect your code per say as you have done a lot of things I aimed to do but couldnt find much on, but ive had trouble deciphering how each bit works(due to my own inability to read large interconnected code) Thank you for your time!

Pjbomb2 avatar Oct 22 '21 15:10 Pjbomb2

You can use this to export an entire BVH


void export_aabb(const AABB & aabb, FILE * file, int index) {
	Vector3 vertices[8] = {
		Vector3(aabb.min.x, aabb.min.y, aabb.min.z),
		Vector3(aabb.min.x, aabb.min.y, aabb.max.z),
		Vector3(aabb.max.x, aabb.min.y, aabb.max.z),
		Vector3(aabb.max.x, aabb.min.y, aabb.min.z),
		Vector3(aabb.min.x, aabb.max.y, aabb.min.z),
		Vector3(aabb.min.x, aabb.max.y, aabb.max.z),
		Vector3(aabb.max.x, aabb.max.y, aabb.max.z),
		Vector3(aabb.max.x, aabb.max.y, aabb.min.z)
	};

	int faces[36] = {
		1, 2, 3, 1, 3, 4,
		1, 2, 6, 1, 6, 5,
		1, 5, 8, 1, 8, 4,
		4, 8, 7, 4, 7, 3,
		3, 7, 6, 3, 6, 2,
		5, 6, 7, 5, 7, 8
	};

	for (int v = 0; v < 8; v++) {
		fprintf_s(file, "v %f %f %f\n", vertices[v].x, vertices[v].y, vertices[v].z);
	}

	for (int f = 0; f < 36; f += 3) {
		fprintf_s(file, "f %i %i %i\n", 8*index + faces[f], 8*index + faces[f+1], 8*index + faces[f+2]);
	}
}

void export_node(const BVH & bvh, FILE * file, int node_index, int & aabb_index) {
	const BVHNode2 & node = bvh.nodes_2[node_index];

	fprintf(file, "o Node_%i\n", node_index);
	export_aabb(node.aabb, file, aabb_index++);

	if (node.is_leaf()) return;

	export_node(bvh, file, node.left,     aabb_index);
	export_node(bvh, file, node.left + 1, aabb_index);
}

void export_bvh(const BVH & bvh) {
	FILE * file; fopen_s(&file, "bvh.obj", "wb");

	int aabb_index = 0;
	export_node(bvh, file, 0, aabb_index);

	fclose(file);
}

The .bvh files that I think you are referring to is a custom binary format. Reading and writing is done in https://github.com/jan-van-bergen/GPU-Pathtracer/blob/master/Src/Assets/BVHLoader.cpp Essentially, the format consists of a header, followed by the triangles, followed by the BVH nodes, followed by the indices. For example to read the header in C# would be something like this:


struct BVHHeader
{
	public uint filetype_identifier;
	public byte filetype_version;

	public byte underlying_bvh_type;
	public byte bvh_is_optimized;
	public byte pad;
	public float sah_cost_node;
	public float sah_cost_leaf;

	public int num_triangles;
	public int num_nodes;
	public int num_indices;
}

static class Ext
{
	public static BVHHeader ReadBVHHeader(this BinaryReader reader)
	{
		return new BVHHeader
		{
			filetype_identifier = reader.ReadUInt32(),
			filetype_version = reader.ReadByte(),
			underlying_bvh_type = reader.ReadByte(),
			bvh_is_optimized = reader.ReadByte(),
			pad = reader.ReadByte(),
			sah_cost_node = reader.ReadSingle(),
			sah_cost_leaf = reader.ReadSingle(),
			num_triangles = reader.ReadInt32(),
			num_nodes = reader.ReadInt32(),
			num_indices = reader.ReadInt32(),
		};
	}
}

class Program
{
	static void Main(string[] args)
	{
		using BinaryReader reader = new BinaryReader(File.Open("Path/To/File.bvh", FileMode.Open));
		BVHHeader header = reader.ReadBVHHeader();
	}
}

Then afterwards you can read num_triangles Triangles, num_nodes BVHNodes, and num_indices ints.

As far as uploading to the GPU goes, that happens here: https://github.com/jan-van-bergen/GPU-Pathtracer/blob/33896a93c3772b8f81719a9b4441f44f87a4a50e/Src/Pathtracer/Pathtracer.cpp#L199 The Triangles and BVHNodes of all Meshes in the Scene are both put in two giant aggregated arrays. The triangles array is permuted according to the indices array, so the indices don't need to be uploaded to the GPU. The triangle array is copied to this CUDA pointer: https://github.com/jan-van-bergen/GPU-Pathtracer/blob/33896a93c3772b8f81719a9b4441f44f87a4a50e/CUDA_Source/Raytracing/Triangle.h#L13 The BVHNodes are copied to this CUDA pointer (in case of the CWBVH): https://github.com/jan-van-bergen/GPU-Pathtracer/blob/33896a93c3772b8f81719a9b4441f44f87a4a50e/CUDA_Source/Raytracing/CWBVH.h#L19

jan-van-bergen avatar Oct 30 '21 14:10 jan-van-bergen

Thank you so much! this has been really helpful, though there is at least one aspect of your basic BVH that confuses me and I cannot find an explanation for Why is the node count of your SAH BVH so low(the node count variable from the bvh struct within the mesh_data struct)? for a model I tried with 950ish triangles, the BVH says it has soemthing like 350 nodes? how does this work if it splits it into 1 triangle per node, how does it have less nodes than there are triangles(and drastically so)? Still trying to convert the base BVH you have to C#, and while I have gotten a lot closer, its still no where near similar(for the same model, using your code above to render it into a new OBJ, your binary SAH BVH has 4044 triangles, whereas my C# version produces a BVH that, again with the same code, an obj that has 23220 triangles)

Thank you so much for your time and responses!

edit: ok I managed to figure out its cuz you are doing collapsing however my output bvh still doesnt even close to match yours, so more investigation is needed, sorry for the trouble

Pjbomb2 avatar Nov 02 '21 18:11 Pjbomb2

So thank you so much with your help thus far, im getting close, but there is one thing now that really confuses me How and why do you send the BVH node information to the GPU via a char pointer? ive never seen something like that done before, and I have no idea how it works, nor how it gets converted back from a char to the relevant information needed in the trace functions? how and why do you do this instead of sending it via a structured thing like you do with the triangle arrays? thank you!

Pjbomb2 avatar Nov 08 '21 16:11 Pjbomb2

I use a char pointer to just treat the memory as a sequence of bytes (each char being one byte). I do this since the BVH node array may be BVHNode2, BVHNode4, or BVHNode8 depending on the config used.

For example here the char array is reinterpreted as a specific BVH Node type depending on the config: https://github.com/jan-van-bergen/GPU-Pathtracer/blob/30fd78637b28247c9334510a691a926011b47d47/Src/BVH/BVH.cpp#L3

The char array is then copied over to the GPU (specifically to one of bvh_nodes, qbvh_nodes, or cwbvh_nodes pointers) where it is again reinterpreted as a specific BVH Node type.

The way to think about this is that memory is just a sequence of bytes, and a type tells you how to interpret those bytes. For example you could have a CUDA-side variable declaration like this: __device__ float number;. But if we then copy over the unsigned number value 0x3f800000 from the CPU to the address of number you will see that number gets the floating point value 1.0f, since 0x3f800000 is the byte representation a floating point number with value 1.

So in this case the char pointer is just a workaround to be able to represent different BVH Node types using the same pointer. Now that I look at this code again, I don't think it's a clean solution so I may end up trying to refactor this a bit.

jan-van-bergen avatar Nov 08 '21 18:11 jan-van-bergen

Thank you so much for your help thus far, its been really useful and im getting really close However, I have some more things that I used to know but now cant find I seem to recall that the standard BVH node builder calls BVH collapser by default, including when preparing to launch for the CWBVH builder but I cant find that code anymore, am I correct that it does do this? if so, doesnt the CWBVH use single triangle leaf nodes? but the BVH collapser returns a bvh with leaf nodes with more than 1 triangle in them correct? how does this work all in all? Thank you for your help, im learning a huge amount here!

Edit: sorry nvm, i figured out that if the BVH type desired is CWBVH, it doesnt collapse the BVH sorry for the trouble But I managed to get both the uncollapsed and collapsed BVH's workin, though interestingly enough, with a the same performance for both

Pjbomb2 avatar Nov 13 '21 21:11 Pjbomb2

Ok so im finally on to possibly trying to convert the CWBVH traversal code But oh my does it confuse me I was wondering, if its even really possible, to run this in a monolithic pathtracer instead of a wavefront pathtracer, as thats what im trying right now? Sorry, bitwise things like this tend to really confuse me right now, but ive gotten this far thank you!

Pjbomb2 avatar Nov 21 '21 19:11 Pjbomb2

Yes it would certainly still work, although it may be slightly less efficient.

Currently the CWBVH traversal kernel looks roughly like this:

while (true) {
	if (inactive) {
		// ... initialize ray (load ray from memory)
		// ... calc inverse direction, ray octant etc
	}
	
	do {
		// Actual work goes here
		// https://github.com/jan-van-bergen/GPU-Pathtracer/blob/ae6eeaa93d9ee52d8c048aa2c12c8719d5255026/CUDA_Source/Raytracing/CWBVH.h#L150
		uint2 triangle_group;
		
		// ... etc
	} while (iterations_lost < threshold)
}

Within a megakernel a CWBVH function would look like this:

// ... initialize ray (no need to load from memory, pass as argument)
// ... calc inverse direction, ray octant etc

while (true) {
	// Actual work goes here, see:
	// https://github.com/jan-van-bergen/GPU-Pathtracer/blob/ae6eeaa93d9ee52d8c048aa2c12c8719d5255026/CUDA_Source/Raytracing/CWBVH.h#L150
	uint2 triangle_group;
	
	// ... etc
}

The main difference is that the original has the additional do while loop. It detects based on some heuristic (using the iterations_lost counter) that not enough threads are currently active within the warp. It then reloads some new rays for those threads. This reduces divergence. With the megakernel approach you can't really do this, so you can just ignore the iterations_lost thing completely.

jan-van-bergen avatar Nov 22 '21 10:11 jan-van-bergen

ok Im understanding this all a lot better now, a few more things though I have hit issues with however first, does the bvh.aggregate function transform the cwbvh node layout used by the CWBVH builder into the layout used by the cwbvhnodes used by the shaders, or does that happen somewhere else? also, what does this do?(In cwbvh.h) sign_extend_s8x4 I see you use assembly stuff for this, which to be honest makes it a compelete mystery to me also all the unsigned used in cwbvh.h, am i understanding how unsigned works wrong, or are they unsigned ints or floats or something else? thank you for your time and help with this, its really appreciated

Pjbomb2 avatar Nov 22 '21 18:11 Pjbomb2

No bvh.aggregate does not convert layout. The CWBVH builder directly outputs the correct layout. The only thing the bvh.aggreate function does is dealing with the Top Level Acceleration Structure (TLAS). The TLAS may consist of several Bottom Level (BLAS) BVHs, one for each Mesh. What the bvh.aggregate function does specifically is convert local indices to global indices. For example, each BVH will reference its first Triangle as Triangle 0. However, when putting the Triangles inside a single aggreated array this becomes untrue. To fix this I add an offset to the Triangle index to correct this. If you are not dealing with TLAS/BLAS stuff you can safely ignore this.

The sign_extend_s8x4 is explained in the paper by Ylitie et al. as "implemented using the PRMT instruction, sign extends each byte in a 32-bit word individually with a single assembly instruction, producing a byte mask for the internal nodes". What this means it that it checks for every byte in a 4 byte word whether the most significant bit is set, if so the whole byte becomes 0xff (all ones), otherwise the byte becomes zero. If you are using something like GLSL, which does not directly expose PTX assembly you can instead implement this behaviour with something like this:

uint sign_extend_s8x4(uint x) {
	// For each byte:
	// 1) Shift most significant bit to least significant bit
	// 2) Mask least significant bit
	// 3) Multiply by 0xff to convert least significant bit into byte mask
	return ((x >> 7) & 0x01010101) * 0xff;
}

As far as unsigned goes, they are just unsigned ints.

jan-van-bergen avatar Nov 22 '21 19:11 jan-van-bergen

ok! thank you so much! I cant find where the BVH8node is converted to using 5 float4's instead of the BVHNode8 struct? Also I am using HLSL for the shader code thank you so much! also is there any good way to test if the CWBVH builder i have works before spending the next couple hours trying to port the shader code that you know of? Thank you!

Pjbomb2 avatar Nov 22 '21 19:11 Pjbomb2

Ah I see, the float4 thing is the same as we have discussed before. When uploading a BVHNode8 to the GPU we are essentially just uploading 80 bytes per node. Note that this is equivalent to 5 * 16 bytes (i.e. 5 float4s). (I actually even statically assert this here:) https://github.com/jan-van-bergen/GPU-Pathtracer/blob/ae6eeaa93d9ee52d8c048aa2c12c8719d5255026/Src/BVH/BVH.h#L79

The reason for using float4s here is that they allow us to make full use of the memory bandwidth the GPU provides.

I use __ldg to load the CWBVH nodes from global memory. This means the memory is loaded via the texture cache, resulting in significant performance improvements. If you want to replicate this in HLSL you should store the CWBVH nodes in a texture. This automatically implies you will need to load the nodes as 5 float4s (RGBA) at a time as well.

jan-van-bergen avatar Nov 22 '21 20:11 jan-van-bergen

ok, but how does that get converted from this struct BVHNode8 { Vector3 p; byte e[3]; byte imask;

unsigned base_index_child;
unsigned base_index_triangle;

byte meta[8] = { };

byte quantized_min_x[8] = { }, quantized_max_x[8] = { };
byte quantized_min_y[8] = { }, quantized_max_y[8] = { };
byte quantized_min_z[8] = { }, quantized_max_z[8] = { };

inline bool is_leaf(int child_index) {
	return (meta[child_index] & 0b00011111) < 24;
}

}; to node_0 through node_4? sorry for my confusion

Pjbomb2 avatar Nov 22 '21 20:11 Pjbomb2

There is no explicit conversion. Its just a reinterpretation of the same bag of 80 bytes, in this case for the purpose of more efficient memory loads.

As you can see here, the data in node_0 to node_4 has not changed at all and is equivalent to the data in the BVHNode_8 struct: https://github.com/jan-van-bergen/GPU-Pathtracer/blob/ae6eeaa93d9ee52d8c048aa2c12c8719d5255026/CUDA_Source/Raytracing/CWBVH.h#L27 node_0 first contains three floats in its x,y,z (this is the Vector3 p from BVHNode8), its w component is reinterpreted as a uint and its first three bytes contain e[3] and the last byte contains imask.

Perhaps an annotated version of the struct is more clear:

struct CWBVHNode {
	float4 node_0; // xyz contains p, w contains e[3] and imask
	float4 node_1; // x contains base_index_child, y contains base_index_triangle, zw contains meta[8]
	float4 node_2; // xy contains quantized_min_x[8], zw contains quantized_max_x[8]
	float4 node_3; // xy contains quantized_min_y[8], zw contains quantized_max_y[8]
	float4 node_4; // xy contains quantized_min_z[8], zw contains quantized_max_z[8]
};

jan-van-bergen avatar Nov 23 '21 10:11 jan-van-bergen

that is a lot more clear thank you! sorry, i dont really know how to work with byte things like this, so a non explicit conversion was confusing to me, but this clears it up a lot! thank you!

Pjbomb2 avatar Nov 23 '21 18:11 Pjbomb2

Ok im finally at a point of testing this thank you so much! though may I ask how you went about debugging this? currently its traversing infinitely but i have no concept of how to even begin to debug this or find whats wrong due to the extensive use of byte and bit functions(I dont even know if im converting the data between structs properly or if my BVH is being built properly), so im wondering how you went about it? Thank you!

Pjbomb2 avatar Nov 23 '21 19:11 Pjbomb2

Debugging the CWBVH was actually a bit of a pain, since my maxwell GPU does not support NSight. If your GPU supports it I highly recommend you use it.

With stuff like this I recommend you start simple and expand iteratively. So you could first start with testing out a single triangle for example. Also you should ignore stuff like triangle postponing at first, since its just an optimization and you can add it later if everything works.

I think the way I went about it was to identify some simple failure case (such as a simple mesh with 8 triangles or whatever) and then compare the observed ray traversal to the expected ray traversal.

I used some function like this to print out the tree so you can more easily visualize the structure of the tree on the CPU

void print_cwbvh(const BVH & bvh, int node_index = 0) {
	const BVHNode8 & node = bvh.nodes_8[node_index];

	for (int i = 0; i < 8; i++) {
		bool node_is_leaf = (node.meta[i] & 0b11111) < 24;
		if (node_is_leaf) {
			int first_triangle = node.meta[i] & 0b11111;

			for (int j = 0; j < __popc(node.meta[i] >> 5); j++) {
				printf("Node %i - Triangle: %i\n", node_index, node.base_index_triangle + first_triangle + j);
			}
		} else {
			int child_offset = node.meta[i] & 0b11111;
			int child_index  = node.base_index_child + child_offset - 24;

			printf("Node %i - Child %i:\n", node_index, child_index);

			print_cwbvh(bvh, child_index);
		}
	}

	printf("\n");
}

ps: one last random thing, make sure you use byte instead of char in C# (if you use it at all). The size of a char is actually not a single byte in C#, that's something I've been bitten by before when uploading from CPU to GPU.

jan-van-bergen avatar Nov 24 '21 19:11 jan-van-bergen

Ah ok thank you! I ended up replaceing all the bytes/chars with uints in the HLSL code, and just bytes in C# Also while my pc does support nvidia insight, i never found out how to really use it other than basic performance things, and I couldnt get it to run on your program(at least a slightly older version of your program) Thank you though! ill try these!

Pjbomb2 avatar Nov 24 '21 23:11 Pjbomb2

Ok trying to run your code for outputting the CWBVH Where would I run it in? as the popc would require cuda, but can cuda output stuff to console? Ive been trying to run everything in the main function thus far, but clearly that wouldnt work with popc thanks! also since i dont know how the traversal works(bitwise things are confusing for me as ive said) i dont know the expected traversal or how it should be traversing

Pjbomb2 avatar Nov 25 '21 21:11 Pjbomb2

No that code is meant for the CPU, popc is just a bit counting function. With C# it looks like its supported as PopCount.

with expected traversal I just mean firing a specific ray at a specific triangle for example.

jan-van-bergen avatar Nov 25 '21 21:11 jan-van-bergen

ah ok thank you! Also the issue i was having was trying to integrate it into your code(so i can compare BVH's) Also ok, I think I got the output you provided working in my project, but I dont know what any of the output information means, or what it should look like

Pjbomb2 avatar Nov 25 '21 21:11 Pjbomb2

I have related questions.

  1. How do you actually perform the intersection with two level BVH? I understand how it's done on a CPU (ie, you just intersect the primitives contained in the box, in this case a bottom level BVH), but how do you do it here?

  2. If you were to add other primitives (spheres, etc. something other than meshes), how would design change?

ib00 avatar Feb 16 '22 09:02 ib00

Hey, So here is roughly how the two level BVH traversal is implemented:

  • A variable tlas_stack_size is used to indicate whether we are in the TLAS or not.
  • If a leaf node is hit we check the value of tlas_stack_size. https://github.com/jan-van-bergen/GPU-Raytracer/blob/1247418f168534249f46daf2f22b93d4af951d3f/Src/CUDA/Raytracing/BVH8.h#L200
    • If tlas_stack_size is INVALID we are in the TLAS. This means use the "triangle" index as a mesh index instead. The ray is transform according to the mesh transform and traversal is continued at the root of the Mesh's BLAS. The value of tlas_stack_size is now set to the current size of the traversal stack.
    • If tlas_stack_size is any other value we are in the BLAS. This performs the usual triangle intersection.
  • Finally the value of tlas_stack_size is used to determine when traversal of a BLAS is finished, and we should revert back to TLAS traversal. https://github.com/jan-van-bergen/GPU-Raytracer/blob/1247418f168534249f46daf2f22b93d4af951d3f/Src/CUDA/Raytracing/BVH8.h#L254

As for question 2, the design wouldnt have to change much on the traversal side. Each mesh would need to store which primitive type it is and based on that you can index the triangle array vs the sphere array. I dont think light sampling would necessarily need to change, although it would probably be a good idea to sample sphere lights differently since there exist more efficient sampling methods than simply area based.

jan-van-bergen avatar Feb 17 '22 09:02 jan-van-bergen

Thanks. The TLAS/BLAS traversal makes sense.

Support for arbitrary primitives seems much harder. You can probably add 1-2 primitive types, but to be more general (like what pbrt-v4 does), it seems much more complex. Vulkan/OptiX/DX12 do it through shader binding tables, which is complex.

Not sure if it would be cleaner or easier in your framework. At the moment you call 'triangle_intersect', but that would have to change based on the primitive type you have (sphere, bilinear patch, curve, etc.).

ib00 avatar Feb 17 '22 10:02 ib00