rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Add API for non-caching-load

Open ranapratap55 opened this issue 1 year ago • 1 comments

RCCL provides "low latency" protocols for communication between agents, where the entire message consisting of data and flags is packed into a single L2 cache line. This is usually accomplished using atomic relaxed instructions in LLVM. But the 128-byte version of this protocol (LL-128) requires 128-bit load or store instructions that bypass the cache and are not broken up into multiple instructions. The nontemporal builtin is not always suitable for this use case.

The proposed approach is to provide a C++ function template that encapsulates an inline assembly call. This asm is intended to use the appropriate load/store parameters for each combination of data size and architecture.

ranapratap55 avatar Aug 09 '24 09:08 ranapratap55

Have you actually verified that the byte two-byte load instructions you are using exist on the ISAs you expect them to exist? If you have, perhaps you want to check again, carefully, if they exist on GFX9 and GFX10? Has this been tested at all? Should it not have some unit tests in tow?

AlexVlx avatar Aug 09 '24 14:08 AlexVlx

Have you actually verified that the byte two-byte load instructions you are using exist on the ISAs you expect them to exist? If you have, perhaps you want to check again, carefully, if they exist on GFX9 and GFX10? Has this been tested at all? Should it not have some unit tests in tow?

Updated the patch with byte, 2-byte load and added test cases.

ranapratap55 avatar Oct 08 '24 05:10 ranapratap55

ping.

ranapratap55 avatar Oct 15 '24 06:10 ranapratap55

ping.

Can you try to integrate this into https://github.com/ROCm/rccl/blob/develop/tools/p2p-latency-test/ll_latency_test.cpp?

wenkaidu avatar Oct 15 '24 15:10 wenkaidu