RayTracing icon indicating copy to clipboard operation
RayTracing copied to clipboard

Multithreading and Optimization Discussion

Open TheCherno opened this issue 2 years ago • 50 comments

Hey all! For those following along with the series on YouTube, I hope you've been enjoying it thus far! In the latest episode (Ep 11), we introduced multithreading into our code base by using std::for_each with the parallel execution policy. I mentioned that if the community has any other suggestions, and wants to do some testing to see if we can multithread in a more efficient way, I'd open a GitHub issue for this discussion - and here we are!

I figured crowd-sourcing this would be a good idea since your mileage may vary - certain techniques might be better or worse on certain hardware and architectures. Feel free to fork this repository and implement something, and then drop it in a comment here so we can test. If your method is faster than our std::for_each method, make sure to include some profiling data and your hardware specifications.

Thanks all! ❤️

  • Cherno

TheCherno avatar Dec 10 '22 10:12 TheCherno

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

cvb941 avatar Dec 10 '22 11:12 cvb941

@TheCherno I checked the newest commit out and tried to run it under x64 Linux and using clang as the compiler. Aside from some difficulties compiling under Linux, std::for_each() by default only runs on 1 thread. Apparently, for multithreading to work with libstdc++, one needs to install Intel TBB and link to it with -ltbb (Docs). With tbb installed and linked, it now runs fairly slowly(Ryzen 7 3800x, default camera angle, 1600x900-> 40-50ms). CPU usage hovers around 90%, so not perfect. I'll try to implement a manual Thread-Pool style approach to see if that is any better.

scorpion-26 avatar Dec 10 '22 11:12 scorpion-26

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

After enabling OpenMP in Project properties > C/C++ > Language, I have added the #pragma omp parallel for clause to the outer loop:

#pragma omp parallel for
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

This gets me from 65ms single threaded to 35ms. However, the for_each implementation runs even lower at 28ms.

The difference comes from the default static schedule which splits the for loop into equal chunks for each CPU thread. This makes the threads that render the sky finish faster than the ones rendering the ball. Adding schedule(dynamic, 1) should distribute the work row-by-row to the threads, making it behave similarly to for_each and speeding up the rendering down to 28ms.

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

Code is in my fork at https://github.com/cvb941/RayTracing

cvb941 avatar Dec 10 '22 12:12 cvb941

Speedup from 65ms to 28ms isn't very much on an 8-core machine. The profiler shows a lot of time being spent in the Random::Vec3 function, where the random number generator probably synchronizes the threads, causing a bottleneck there.

Just using a simple rand() in the Rand::Float() function reduces the frame time from 28ms to 10ms. Theoretically, this can be reduced further with a faster random (returning a constant achieves 7ms).

static float Float()
{
    return static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    // return (float)s_Distribution(s_RandomEngine) / (float)std::numeric_limits<uint32_t>::max();
}

cvb941 avatar Dec 10 '22 14:12 cvb941

oof

i mean, do i need to say anything? :D

was pretty confused about the 8x increase in physical core count only getting a 2x speedup. (i'm getting around 3x on my i7-8750h -- 180ms -> 65ms -- which both seems high, but that's beside the point)

for this kind of simple parallelization of a compute heavy task, i'd expect a speedup of 90% * #physical-cores. (hyperthreading doesn't tend to help much, when you're compute bound. and there are always some losses, hence the 90%)

but yeah, get some thread local RNGs going :D

leddoo avatar Dec 10 '22 14:12 leddoo

For fast non-cryptographically-secure PRNGs I usually go with the xoshiro family of generators since they are robust enough for my applications and very fast. I've often implemented one myself but a wrapper like Xoshiro-cpp seem quite practical.

HildingLinden avatar Dec 10 '22 14:12 HildingLinden

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

Update: @leddoo already covered it.

athieman avatar Dec 10 '22 16:12 athieman

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each. https://github.com/xLoge/RayTracing

xLoge avatar Dec 11 '22 15:12 xLoge

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

I tried this and it worked quite well. My render times were consistently around 5-6ms, and CPU usage was well below 100%, closer to 10%. It seems that once RNG is thread local, the bottleneck is no longer the CPU, though I don't know how to verify that or figure out what the new bottleneck is.

LilyIsTrans avatar Dec 16 '22 08:12 LilyIsTrans

@Rilazy The new "bottleneck" is not a bottleneck - it's a vsync enabled by default in the Imgui renderer. So after your CPU finishes calculation for new frame image, it waits for gpu vsync event. To test that you can increase viewport size (on my CPU current scene renders slower than 60fps in 4k viewport), or disable vsync in Imgui by uncommenting define IMGUI_UNLIMITED_FRAME_RATE in Application.cpp.

LeoMash avatar Dec 24 '22 00:12 LeoMash

Interesting, thanks!

LilyIsTrans avatar Dec 24 '22 01:12 LilyIsTrans

So what are you guys using for profiling? I set up Tracy but It's not really usable due to the large amount of threads being created and destroyed with the current architecture.
I am going to look into creating a threadpool with number of threads according to available cores. Does that make sense?

soerenfi avatar Dec 29 '22 19:12 soerenfi

I've just been using Visual Studio's built in profiler when I need proper profiling, but mostly I'm looking at the frame times and seeing what they hover around

LilyIsTrans avatar Dec 29 '22 21:12 LilyIsTrans

Alright, it's done here. Added a thread pool and queue for the rendering jobs. Every Line of the image is submitted to the queue individually. Here is how it looks in Tracy: image_2022-12-29_232710033

performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

soerenfi avatar Dec 29 '22 23:12 soerenfi

Have you implemented the optimization making random number generation thread local? If not, it's possible your thread pool just isn't handling synchronization as well as the std::foreach method

LilyIsTrans avatar Dec 29 '22 23:12 LilyIsTrans

@soerenfi Disregard my previous comment for now; it looks like it's the same thing I ran into with vsync, since your framerate looks to be locked at pretty much exactly 60fps. What's happening is that ImGui is set up by default to wait for the next time your monitor refreshes the image to start drawing the new one. LeoMash explained it here and ways to get around it when I ran into the same thing.

LilyIsTrans avatar Dec 30 '22 01:12 LilyIsTrans

@Rilazy yeah, I saw that. For now I just compared CPU load between the approaches and varied the amount of threads in the pool.

Profiler shows that the workers are idling some of the time so that's probably the difference. Did not expect to write a better scheduler than the OS's in one evening.

No point in optimising this further since the ray tracing algorithm itself is so inefficient right now. We could do the PerPixel() intersection calculations on the GPU using CUDA though, just as an exercise..

soerenfi avatar Dec 30 '22 09:12 soerenfi

Another option for moving work to the GPU would be the Vulkan ray tracing pipeline itself. I'm not super familiar with GPU programming, but reading up on the Vulkan API it looks like we could be allowed to still use implicit functions for our spheres by passing them as AABBs that enclose them, then doing the more expensive intersection testing in the intersection shader, and from there I think all of our existing code is valid to port pretty much directly to shader code.

I'd have tested it myself already if it weren't for the fact that I'm not familiar enough with graphics programming to have the first clue on how to actually set up Vulkan to draw to the framebuffer of our viewport, I've only ever done the "Draw a triangle" Vulkan tutorial. I've been reading the docs, but there are a lot of docs.

Of course, that all does depend on a graphics driver that supports the Vulkan Ray Tracing Extensions.

LilyIsTrans avatar Dec 30 '22 21:12 LilyIsTrans

I believe that is what @TheCherno is aiming to do, and if so, I am looking forward to it very much. Since Vulkan takes care of building BVHs and (HW-)accelerating ray-triangle intersections, there is not much point in spending time to implement them on the CPU (still worth understanding them though).

Meanwhile you can check out the NVIDIA guide to VK raytracing (uses NVIDIA HW specific support libraries and this amazing implementation of Peter Shirley's Raytracing in one Weekend using Vulkan Raytracing.

soerenfi avatar Dec 30 '22 22:12 soerenfi

Neat! I'll have to check that out.

LilyIsTrans avatar Dec 30 '22 22:12 LilyIsTrans

I'll also add this more simplified path tracing example that I'm about to dive into.

soerenfi avatar Dec 31 '22 00:12 soerenfi

I stumbled into your video, and i think there is a better way to use iterators for std::for_each. Basically you could make your own iterator with all the appropriate operators (following the LegacyForwardIt requirements) which simply increments an integer. This way you're not intializing a whole vector with N elements going from X to Y. But instead you'll have two iterators, one representing the moving index and a second one representing the end iterator, and thanks to the != operator of the iterator, std::for_each knows when to stop.

template <typename T>
class ForEachIterator
{
public:
  ForEachIterator(const T &p_value) : _value(p_value) {}

  ForEachIterator<T> &operator++()
  {
    ++this->_value;
    return *this;
  }

  T &operator*()
  {
    return this->_value;
  }

  bool operator!=(const ForEachIterator<T> &p_it)
  {
    return this->_value != p_it._value;
  }

private:
  T _value;
};

And you can you use it this way:

std::for_each(ForEachIterator<uint32_t>(0), ForEachIterator<uint32_t>(width), [](const uint32_t &i) {
  ...
});

mmerabet42 avatar Jan 11 '23 16:01 mmerabet42

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each. https://github.com/xLoge/RayTracing

SHISHUA is faster than xoshiro256+x8.

WhoseTheNerd avatar Jan 20 '23 13:01 WhoseTheNerd

Has anyone tried Intel OneAPI TBB for the threads parallelization?

Paulo-D2000 avatar Jan 21 '23 06:01 Paulo-D2000

Initializing a vector of random unit normals gave me another 30-40% performance improvement on top of making the random number generator thread_local , as @TheCherno described in his corresponding YouTube episode.

denizdiktas avatar Jan 31 '23 11:01 denizdiktas

ok here is what I have done later on: I have added a timer that just measures the time it takes for the code that renders the image (basically a timer surrounding just the code with #define MT) by taking the average render time (accumulate the timer and divide by the number of total frames so far). By caching the normals and measuring the performance this way my render time dropped from about 30ms to 24ms on average. there are about 1 million cached random directions, so the image quality more or less remains the same (at least visually).

denizdiktas avatar Feb 03 '23 07:02 denizdiktas

Regarding different threading and parallelization methods: I think we should first measure the performance of the scheduled rendering code inside the thread and the overall threading method together with the rendering code. The difference will then tell us how much of a room there is for improvement. For this I have done the following: leaving the code as it is, I have measured the time it takes for rendering only (basically, the code inside the lambda function) and associated this measured time with its corresponding thread: now to do this a natural way would be to use an associative container (std::map or std::unordered_map) but I didn't want to add an extra overhead by including a lock for this map as well, instead I used a preallocated array of float values to keep track of the total rendering time inside each thread. This eliminated the overhead of locking & unlocking a map, so it has minimal impact on the measurements.

Now why am I talking about threads here? @TheCherno uses for_each(std::execution::par,..) anyway, right? The std implementation must be using some kind of a thread-pool in the background, I have tried this in a separate project: I have submitted a bunch of trivial jobs and looked at unique thread-ids. It turns out that the number of acutal worker-threads is less than the number of submitted jobs, and if some jobs take more time than others, the number of threads tends to increase. So applying this method to the ray-tracing code, I have found out that there are about 8-9 threads (I have 8 cores) and there is very little difference between the time it takes just for rendering and the overall time for rendering + scheduling. Of course the average time in each thread is going to vary, but in my case the variation was like 1ms in 22ms. Also the difference between the thread rendering times and the overall rendering time was like about 1-2 ms (overall was about 23-24ms). So I am not sure if this is a significant overhead when using for_each(std::execution::par).

To be honest, first I implemented my own thread-pool and quickly realized that there are lots of improvements that can be done on it. Then I wondered if it was worth it for this particular rendering problem (it is still always a good idea to think about the potential improvements regardless of the use case of course, I am aware of this. I am just talking in the context of this problem). That's when and how I thought of a method for finding out how much room there is for improvement. I would love to hear your thoughts about this and let me know if something is not clear.

denizdiktas avatar Feb 06 '23 07:02 denizdiktas

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

denizdiktas avatar Feb 07 '23 13:02 denizdiktas

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

I'm not sure, but I think it probably has to do with it being a Forward iterator as opposed to a Random Access iterator. Because the iterator doesn't provide us with Random Access, it must be iterated sequentially, since knowing the index of each pixel/column technically requires knowing the index of the previous, as computed by a mutable, non shareable object

LilyIsTrans avatar Feb 13 '23 18:02 LilyIsTrans

Hi @Rilazy, Ok thanks, I'll look into it. I was trying to do a tiled-rendering version and didn't want to change the usage of for_each(execution::par). So I had to implement my own tile-iterators, otherwise I have to define a long list of tiles and then for_each() needs to touch each memory location, which seems inefficient. It is also a good exercise to implement such an iterator compatible with for_each, just for its own sake 🙂

denizdiktas avatar Feb 14 '23 04:02 denizdiktas