oneCCL icon indicating copy to clipboard operation
oneCCL copied to clipboard

Allreduce cpu example fails with CCL_WORKER_COUNT > 1

Open piotrchmiel opened this issue 1 year ago • 3 comments

I started playing with allreduce example from the main repository https://github.com/oneapi-src/oneCCL/blob/master/examples/cpu/cpu_allreduce_test.cpp .

I modified it slightly by increasing the buffer size 100 times:

diff --git a/examples/cpu/cpu_allreduce_test.cpp b/examples/cpu/cpu_allreduce_test.cpp
index 6e9ac4d..5dfe2d9 100644
--- a/examples/cpu/cpu_allreduce_test.cpp
+++ b/examples/cpu/cpu_allreduce_test.cpp
@@ -22,7 +22,7 @@
 using namespace std;

 int main() {
-    const size_t count = 4096;
+    const size_t count = 4096*100;

     size_t i = 0;

When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:

piotrc@machine:~/ws/oneCCL/build$ CCL_WORKER_COUNT=2 mpirun -np 2 examples/cpu/cpu_allreduce_test
[1705415958.879795729] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
[1705415958.879801821] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 559315 RUNNING AT gbnwp-pod023-1
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 559316 RUNNING AT gbnwp-pod023-1
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

With CCL_WORKER_COUNT=1 it works perfect.

piotrc@machine:~/ws/oneCCL/build$ mpirun -np 2 examples/cpu/cpu_allreduce_test
PASSED

What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.

Attached CCL_LOG_LEVEL=info logs.txt Attached CCL_LOG_LEVEL=debug logs_debug.txt

piotrchmiel avatar Jan 16 '24 14:01 piotrchmiel

Possible workaround:

FI_PROVIDER=verbs CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test PASSED

FI_PROVIDER=tcp CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test PASSED

piotrchmiel avatar Jan 23 '24 15:01 piotrchmiel

@piotrchmiel Hi. Your fi_info should say that psm3 is available for you, do you see that? Please execute it and check. https://github.com/oneapi-src/oneCCL/tree/master/deps/ofi/bin Can you please give a hint how do you compile oneccl?

nikitaxgusev avatar Jan 24 '24 10:01 nikitaxgusev

@piotrchmiel , you can try this. echo 0 > /proc/sys/kernel/yama/ptrace_scope.

yao-matrix avatar Apr 28 '24 07:04 yao-matrix