godot icon indicating copy to clipboard operation
godot copied to clipboard

Enhance SpinLock

Open RandomShaper opened this issue 2 years ago • 4 comments

As advised by https://rigtorp.se/spinlock/ and other resources, a good spinlock should do the following at least (and the following points are what this PR adds):

  • Use the CPU pause instruction while looping, to inform the hardware we are spinning so it does the right thing. Depending on architectures, this may improve power efficiency, for instance (better usage of phone/laptop battery). Not only that, but hyper-threaded cores can also work better since one of the logical hardware threads knows the other isn't doing anything particularly useful while the latter is just spinning around.
  • If the first attempt to lock fails, instead of blindlessly retrying, loop with relaxed memory ordering until it seems the lock is available again and, only then, retry with acquire semantics. This avoids wasteful sync operations that may improve performanve.

As a disclaimer, I haven't benchmarked this for either performance nor power usage. Maybe the Production team can lend a hand.

RandomShaper avatar Nov 21 '23 09:11 RandomShaper

Use the CPU pause instruction while looping, to inform the hardware we are spinning so it does the right thing. Depending on architectures, this may improve power efficiency, for instance (better usage of phone/laptop battery). Not only that, but hyper-threaded cores can also work better since one of the logical hardware threads knows the other isn't doing anything particularly useful while the latter is just spinning around.

Is the AMD Zen MWAITX instruction worth using in this case? It's been reported to result in significant power savings which can benefit the Steam Deck. This would need detecting whether the instruction is available though.

Note that if this is implemented, this will need to be tested by someone else as I don't have the hardware to test it.

Calinou avatar Nov 21 '23 15:11 Calinou

Is the AMD Zen MWAITX instruction worth using in this case?

I didn't know about it, to be honest. However, judging by the spec, it seems that MONITORX-MWAITX make for an extremely efficient hardware-supported spinlock. I asked an AMD engineer about our spinlock, by the way, and, weirdly enough, he wouldn't mention these instructions.

Now I'm wondering if having to branch to the MWAITX-based implementation or the classic one, plus having both inlined would kill the benefits somehow. Hard to tell without measuring.

RandomShaper avatar Nov 21 '23 15:11 RandomShaper

I'd like to add my two cents saying that I'm eager to see this class being improved as the current version can really be dangerous for performance. I suggest this excellent read that goes a bit deeper into Spinlock implementations and shows how the one used by Godot can easily make threads freeze for a few ms: https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/

Note that despite the improvements, Mutex is still a better choice on modern hardware and systems in most cases, even for short locks.

CedNaru avatar Apr 16 '24 17:04 CedNaru

~I've removed the arch-dependant pieces and used std::this_thread::yield instead. That should bring the appropriate semantics to all the supported platforms already.~

RandomShaper avatar Oct 21 '24 11:10 RandomShaper

I'm bumping this because I've added benchmark results to the description.

RandomShaper avatar Oct 23 '24 13:10 RandomShaper

Going back to the CPU pause approach, once realized thread yield is not appropriate for this. I've also applied @bruvzg's suggestions.

In the future we may want to get rid of spinlocks altogether, but for now let's at least have a better one.

Other CPU-specific instructions such as the mwaitxx mentioned by @Calinou or ARM's wfe could make for a subsequent iteration of this.

RandomShaper avatar Nov 04 '24 10:11 RandomShaper

Thanks!

Repiteo avatar Nov 10 '24 18:11 Repiteo

This changeset causes Godot to crash on startup on Linux when using builds compiled with AVX2 ISAs enabled (AMD Excavator, Intel Haswell or newer)

git checkout 01ad56da38fa10949ff7878d4d6a5d6c195eff39
scons target=editor ccflags="-march=x86-64-v3"
./bin/godot.linuxbsd.editor.x86_64

This occurred on an AMD Ryzen 9 7900X.

Edit 2: This seems to be a cacheline issue. Adding alignas(CACHE_LINE_SIZE) to every usage of spinline makes it segfault in a different place

Stacktrace 1 (stock)

* thread #1, name = 'godot.linuxbsd.editor.x86_64', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: RendererSceneCull::RendererSceneCull(this=<unavailable>) at self_list.h:165:25
    frame #1: RenderingServerDefault::_init(this=<unavailable>) at rendering_server_default.cpp:215:26
    frame #2: Main::setup2(p_show_boot_logo=<unavailable>) at main.cpp:3106:25
    frame #3: Main::setup(execpath=<unavailable>, argc=0, argv=<unavailable>, p_second_phase=true) at main.cpp:2648:20
    frame #4: main(argc=1, argv=<unavailable>) at godot_linuxbsd.cpp:74:25

Stacktrace 2 (inline everywhere)

* thread #1, name = 'godot.linuxbsd.editor.x86_64', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: WorkerThreadPool::WorkerThreadPool() [inlined] SelfList<WorkerThreadPool::Task>::List::List(this=<unavailable>) at self_list.h:165:25
    frame #1: WorkerThreadPool::WorkerThreadPool(this=<unavailable>) at worker_thread_pool.cpp:835:36
    frame #2: register_core_types() at register_core_types.cpp:301:23
    frame #3: Main::setup(execpath=<unavailable>, argc=0, argv=<unavailable>, p_second_phase=true) at main.cpp:952:21
    frame #4: main(argc=1, argv=<unavailable>) at godot_linuxbsd.cpp:74:25

The segfault occurs when zeroing the atomic lock data which it does by moving a zero from the ymm register.

neptuwunium avatar Nov 11 '24 05:11 neptuwunium

@RandomShaper AMD stated in their optimization guide that the compiler can automatically compile std_mutex into the mwaitx hardware instructions. https://gpuopen.com/gdc-presentations/2023/GDC-2023-AMD-Ryzen-Processor-Software-Optimization.pdf

jams3223 avatar Nov 11 '24 11:11 jams3223

As a note, that's not guaranteed. On POSIX it will be delegated to pthread_mutex_{lock,unlock}, which calls lock cmpxchg instead of mwaitx, however from what I understand depending on architecture microcode it may end up being identical over a longer period of time. (Especially with kernel-level thread schedulers.)

Edit: Looking over the assembly, this seems to do lock cmpxchg anyway due to the atomic set/clear. It might just be simpler and safer to use a mutex.

neptuwunium avatar Nov 11 '24 16:11 neptuwunium

Removing the alignas(Thread::CACHE_LINE_BYTES) line fixes the crashes on AVX2 compile targets.

__attribute__((aligned(Thread::CACHE_LINE_BYTES))) strangely, does as well. MSVC equivalent would be __declspec(align(Thread::CACHE_LINE_BYTES))

Either this attribute is ignored or alignas does not behave as expected.

Edit: Reading up on the alignas keyword, it apparently cannot exceed alignof(max_align_t) which is 16. Cache lines are defined to be 64 bytes on most x86 systems. So any alignas falls back to 16 (which is standard for x86-64-v1) however needs 32 and 64 for AVX2 and AVX512 respectively. __attribute__((aligned())) and __declspec(align()) function differently as they enforce alignment regardless of any other factors.

Edit 2: 😅 now it's segfaulting in AVX-512 CPU targets, resolved by bumping the alignment to 128. The particular sensitivity to cache line size is frustrating.

Edit 3: It still crashes, just at random points now. This is very unstable on x86-64-v3 and v4.

neptuwunium avatar Nov 11 '24 17:11 neptuwunium

@yretenai. thank you for your findings. Can you give me stack traces of the crashes? I'm wondering if there are cases where the SpinLock ends up misaligned for some reason. This constructor may help:

	SpinLock() {
		CRASH_COND(uintptr_t(this) % Thread::CACHE_LINE_BYTES != 0);
	}

RandomShaper avatar Nov 12 '24 12:11 RandomShaper

Thread::CACHE_LINE_BYTES = 64

Built with scons target=editor debug_symbols=on -j24

ERROR: FATAL: Condition "uintptr_t(this) % Thread::CACHE_LINE_BYTES != 0" is true.
   at: SpinLock (./core/os/spin_lock.h:93)

  * frame #0: 0x000055555a3cc8fd CallQueue::CallQueue(PagedAllocator<CallQueue::Page, true, 4096u>*, unsigned int, String const&) [inlined] SpinLock::SpinLock(this=0x000055555d7951b0) at spin_lock.h:93:3
    frame #1: 0x000055555a3cc8d3 CallQueue::CallQueue(PagedAllocator<CallQueue::Page, true, 4096u>*, unsigned int, String const&) [inlined] SpinLock::SpinLock(this=0x000055555d7951b0) at spin_lock.h:92:2
    frame #2: 0x000055555a3cc8d3 CallQueue::CallQueue(PagedAllocator<CallQueue::Page, true, 4096u>*, unsigned int, String const&) [inlined] PagedAllocator<CallQueue::Page, true, 4096u>::PagedAllocator(p_page_size=16, this=0x000055555d795170) at paged_allocator.h:159:59
    frame #3: 0x000055555a3cc8c0 CallQueue::CallQueue(this=0x000055555d794dc0, p_custom_allocator=<unavailable>, p_max_pages=8192, p_error_text=0x00007fffffffd310) at message_queue.cpp:468:15
    frame #4: 0x000055555a3ccad1 MessageQueue::MessageQueue(this=0x000055555d794dc0) at message_queue.cpp:509:113
    frame #5: 0x0000555555caba3e Main::setup(execpath=<unavailable>, argc=0, argv=0x00007fffffffdc80, p_second_phase=true) at main.cpp:2634:18
    frame #6: 0x0000555555bc83ff main(argc=1, argv=0x00007fffffffdc78) at godot_linuxbsd.cpp:74:25

Built with scons target=editor ccflags="-march=x86-64-v3" debug_symbols=on -j24

ERROR: FATAL: Condition "uintptr_t(this) % Thread::CACHE_LINE_BYTES != 0" is true.
   at: SpinLock (./core/os/spin_lock.h:93)

  * frame #0: 0x000055555a48d795 WorkerThreadPool::WorkerThreadPool() [inlined] SpinLock::SpinLock(this=0x000055555d689160) at spin_lock.h:93:3
    frame #1: 0x000055555a48d76b WorkerThreadPool::WorkerThreadPool() [inlined] SpinLock::SpinLock(this=<unavailable>) at spin_lock.h:92:2
    frame #2: 0x000055555a48d76b WorkerThreadPool::WorkerThreadPool() [inlined] PagedAllocator<WorkerThreadPool::Task, false, 1024u>::PagedAllocator(p_page_size=<unavailable>, this=<unavailable>) at paged_allocator.h:159:59
    frame #3: 0x000055555a48d76b WorkerThreadPool::WorkerThreadPool(this=0x000055555d688fa0) at worker_thread_pool.cpp:835:36
    frame #4: 0x0000555559df39ec register_core_types() at register_core_types.cpp:301:23
    frame #5: 0x0000555555c9db0c Main::setup(execpath=<unavailable>, argc=0, argv=0x00007fffffffdc80, p_second_phase=true) at main.cpp:952:21
    frame #6: 0x0000555555bc724f main(argc=1, argv=0x00007fffffffdc78) at godot_linuxbsd.cpp:74:25

Reminder: the code works without alignment. The compiler will attempt to ensure std::atomic is aligned to the required alignment by introducing padding structures (which it seems to fail to do when forced alignment is introduced,) but this may not per-se be aligned to the cache line as demonstrated here:

ERROR: FATAL: Condition "uintptr_t(this) % Thread::CACHE_LINE_BYTES != 0" is true.
   at: SpinLock (./core/os/spin_lock.h:93)

  * frame #0: 0x0000555555bf666d _GLOBAL__sub_I__ZN7Variant5Pools13_bucket_smallE() [inlined] SpinLock::SpinLock(this=0x0000000007cb1844) at spin_lock.h:93:3
    frame #1: 0x0000555555bf6663 _GLOBAL__sub_I__ZN7Variant5Pools13_bucket_smallE() [inlined] SpinLock::SpinLock(this=0x0000000007cb1844) at spin_lock.h:92:2
    frame #2: 0x0000555555bf665c _GLOBAL__sub_I__ZN7Variant5Pools13_bucket_smallE() [inlined] PagedAllocator<Variant::Pools::BucketSmall, true, 4096u>::PagedAllocator(p_page_size=4096, this=0x0000000007cb1820) at paged_allocator.h:159:59
    frame #3: 0x0000555555bf663e _GLOBAL__sub_I__ZN7Variant5Pools13_bucket_smallE() [inlined] __static_initialization_and_destruction_0 at variant.cpp:41:67
    frame #4: 0x0000555555bf663e _GLOBAL__sub_I__ZN7Variant5Pools13_bucket_smallE() at variant.cpp:3698:1

Also: Enforcing alignment with __attribute__((aligned(Thread::CACHE_LINE_BYTES))) does not stop the assertion. Having run it a bunch of times now, due to ASLR sometimes it magically aligns everything and runs fine.

neptuwunium avatar Nov 12 '24 19:11 neptuwunium