caffe icon indicating copy to clipboard operation
caffe copied to clipboard

Memory issue when running Caffe across nodes

Open yuezhu1 opened this issue 7 years ago • 11 comments

Hi,

I got a few problems when try to run caffe on distributed nodes, but it works well when have multiple processes on the same node.

The errors are shown below: node292: Unable to allocate shared memory for intra-node messaging. node292: Delete stale shared memory files in /dev/shm. ^@node293: Unable to allocate shared memory for intra-node messaging. node293: Delete stale shared memory files in /dev/shm. ^@node291: Unable to allocate shared memory for intra-node messaging. node291: Delete stale shared memory files in /dev/shm. ^@node294: Unable to allocate shared memory for intra-node messaging. node294: Delete stale shared memory files in /dev/shm.

As I am on a cluster, I used mpirun to source the environment script across nodes.

mpirun -n 4 -ppn 1 -f /path/to/caffe/hostfile
bash /path/to/caffe/external/mlsl/l_mlsl_2018.0.003/intel64/bin/mlslvars.sh

export MLSL_ROOT=/path/to/caffe/external/mlsl/l_mlsl_2018.0.003

#training on 4 node, 1 process per node mpirun -n 4 -ppn 1 -f /path/to/caffe/hostfile
/path/to/caffe/build/tools/caffe train --solver
/path/to/caffe/examples/cifar10/cifar10_quick_solver.prototxt

MPI version: Intel(R) MPI Library for Linux* OS, Version 2018 Copyright (C) 2003-2017, Intel Corporation. All rights reserved.

I wonder if anyone has any ideas about how to fix this.

Thank you in advance, Yue

yuezhu1 avatar Jun 27 '18 00:06 yuezhu1

Could you check if there is shared memory folder: /dev/shm, on your nodes(291-294)? And what do you mean running multiple process on same node? like "mpirun -n 4 -ppn4 ..."?

And you don't need to set environment variables via mpirun. mpirun command will pass the environment (on the node executed) to all of nodes.

fzou1 avatar Jun 27 '18 00:06 fzou1

Thanks for you quick reply.

I checked the shared memory folder on the nodes. They do have the folder.

For the multiple process on same node, I mean multiple ranks on same node, which it is like you mentioned.

Thanks for indicating the environment variables problem.

Thanks again, Yue

yuezhu1 avatar Jun 27 '18 15:06 yuezhu1

Could you also check if you can create and delete file in /dev/shm on these nodes?

fzou1 avatar Jun 27 '18 16:06 fzou1

Thanks for your reply.

Below are the commands I used for creating and deleting files in /dev/shm

srun -N 4 -n 4 touch /dev/shm/test.txt srun -N 4 -n 4 echo "Test File!" > /dev/shm/test.txt srun -N 4 -n 4 cat /dev/shm/test.txt srun -N 4 -n 4 rm /dev/shm/test.txt

This is the output:

Test File! Test File! Test File! Test File!

Therefore, I think that I do have the permission to the path (/dev/shm).

yuezhu1 avatar Jun 27 '18 17:06 yuezhu1

Did you run the srun commands above on node 291-294? Or other nodes in cluster? Please check the /dev/shm folder on node 291-294. What is size of /dev/shm folder? We recommend to allocate >=40GB for most topologies. And is there any file cannot be deleted? Could you provide full of error log if possible?

fzou1 avatar Jun 27 '18 23:06 fzou1

Thanks for your reply.

Yes. All nodes have the /dev/shm folder.

The follow is the information of /dev/shm folder. Filesystem Size Used Avail Use% Mounted on tmpfs 126G 6.6M 126G 1% /dev/shm

Although I compiled the Caffe with DEBUG mode on, there is no output from DLOG. Besides, I added a few text printing message (via DLOG) starting from the first line of main() of caffe.cpp, and also the init() in src/caffe/multinode/mlsl.cpp. I found out the program stacked at

MLSL::Environment::GetEnv().Init(argc, argv);

which is line 59.

So I enabled the mpi debug mode and tried to gather more information. I attached the mpi log info at the bottom.

command for running caffe: mpirun -n 2 -ppn 1 -f /path/to/caffe/hostfile /path/to/caffe/build/tools/caffe train --solver /path/to/caffe/examples/cifar10/cifar10_quick_solver.prototxt

Please feel free to let me know if more information is needed.

Thank you again, Yue

MPI log info:

WARNING: Logging before InitGoogleLogging() is written to STDERR I0627 18:07:25.807943 199966 caffe.cpp:725] main test 1 I0627 18:07:25.808219 199966 caffe.cpp:740] main test 2 I0627 18:07:25.809576 199966 caffe.cpp:742] main test 3 I0627 18:07:25.811481 199966 caffe.cpp:744] main test 4 I0627 18:07:25.811489 199966 caffe.cpp:750] main test 5 I0627 18:07:25.811492 199966 caffe.cpp:753] main test 5.1 I0627 18:07:25.811496 199966 mlsl.cpp:58] mn 1 WARNING: Logging before InitGoogleLogging() is written to STDERR I0627 18:07:25.978885 60548 caffe.cpp:725] main test 1 I0627 18:07:25.979142 60548 caffe.cpp:740] main test 2 I0627 18:07:25.980442 60548 caffe.cpp:742] main test 3 I0627 18:07:25.982179 60548 caffe.cpp:744] main test 4 I0627 18:07:25.982200 60548 caffe.cpp:750] main test 5 I0627 18:07:25.982204 60548 caffe.cpp:753] main test 5.1 I0627 18:07:25.982209 60548 mlsl.cpp:58] mn 1 [0] MPI startup(): Intel(R) MPI Library, Version 2018 Update 2 Build 20180125 (id: 18157) [0] MPI startup(): Copyright (C) 2003-2018 Intel Corporation. All rights reserved. [0] MPI startup(): Multi-threaded optimized library [0] MPID_nem_ofi_dump_providers(): Dumping Providers(first=0x55555586c3d0 (psm)): [0] MPID_nem_ofi_dump_providers(): psm [0] MPID_nem_ofi_dump_providers(): verbs [0] MPID_nem_ofi_dump_providers(): verbs [0] MPID_nem_ofi_dump_providers(): verbs;ofi_rxm [0] MPID_nem_ofi_dump_providers(): verbs;ofi_rxm [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_init(): used OFI provider: psm [0] MPID_nem_ofi_init(): max_buffered_send 64 [0] MPID_nem_ofi_init(): max_msg_size 64 [0] MPID_nem_ofi_init(): rcd switchover 32768 [0] MPID_nem_ofi_init(): cq entries count 8 [0] MPID_nem_ofi_init(): MPID_REQUEST_PREALLOC 128 [1] MPI startup(): shm and ofi data transfer modes [0] MPI startup(): shm and ofi data transfer modes [0] MPI startup(): Device_reset_idx=11 [0] MPI startup(): Allgather: 3: 1-4 & 0-2 [0] MPI startup(): Allgather: 1: 5-2048 & 0-2 [0] MPI startup(): Allgather: 3: 0-2147483647 & 0-2 [0] MPI startup(): Allgather: 1: 0-3733 & 3-4 [0] MPI startup(): Allgather: 3: 3734-4194 & 3-4 [0] MPI startup(): Allgather: 1: 4195-25927 & 3-4 [0] MPI startup(): Allgather: 3: 25928-46799 & 3-4 [0] MPI startup(): Allgather: 1: 0-2147483647 & 3-4 [0] MPI startup(): Allgather: 2: 1-8 & 5-8 [0] MPI startup(): Allgather: 1: 9-32 & 5-8 [0] MPI startup(): Allgather: 2: 33-256 & 5-8 [0] MPI startup(): Allgather: 1: 0-2147483647 & 5-8 [0] MPI startup(): Allgather: 2: 1-8 & 9-16 [0] MPI startup(): Allgather: 1: 9-16 & 9-16 [0] MPI startup(): Allgather: 2: 17-64 & 9-16 [0] MPI startup(): Allgather: 1: 65-128 & 9-16 [0] MPI startup(): Allgather: 2: 129-512 & 9-16 [0] MPI startup(): Allgather: 1: 0-2147483647 & 9-16 [0] MPI startup(): Allgather: 2: 0-64 & 17-2147483647 [0] MPI startup(): Allgather: 1: 0-2147483647 & 17-2147483647 [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2 [0] MPI startup(): Allgatherv: 1: 0-1 & 3-4 [0] MPI startup(): Allgatherv: 2: 2-4 & 3-4 [0] MPI startup(): Allgatherv: 1: 5-8 & 3-4 [0] MPI startup(): Allgatherv: 2: 9-16 & 3-4 [0] MPI startup(): Allgatherv: 1: 17-32 & 3-4 [0] MPI startup(): Allgatherv: 2: 33-2048 & 3-4 [0] MPI startup(): Allgatherv: 3: 2049-4096 & 3-4 [0] MPI startup(): Allgatherv: 2: 4097-24207 & 3-4 [0] MPI startup(): Allgatherv: 3: 24208-77270 & 3-4 [0] MPI startup(): Allgatherv: 1: 77271-210263 & 3-4 [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 3-4 [0] MPI startup(): Allgatherv: 2: 0-4 & 5-8 [0] MPI startup(): Allgatherv: 4: 5-15 & 5-8 [0] MPI startup(): Allgatherv: 2: 16-132 & 5-8 [0] MPI startup(): Allgatherv: 4: 133-511 & 5-8 [0] MPI startup(): Allgatherv: 2: 512-4531 & 5-8 [0] MPI startup(): Allgatherv: 1: 4532-12951 & 5-8 [0] MPI startup(): Allgatherv: 3: 12952-53515 & 5-8 [0] MPI startup(): Allgatherv: 1: 53516-187447 & 5-8 [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-8 [0] MPI startup(): Allgatherv: 2: 1-512 & 9-16 [0] MPI startup(): Allgatherv: 1: 513-65536 & 9-16 [0] MPI startup(): Allgatherv: 2: 65537-131072 & 9-16 [0] MPI startup(): Allgatherv: 1: 131073-578839 & 9-16 [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 9-16 [0] MPI startup(): Allgatherv: 2: 1-65536 & 17-2147483647 [0] MPI startup(): Allgatherv: 1: 65537-218034 & 17-2147483647 [0] MPI startup(): Allgatherv: 3: 0-2147483647 & 17-2147483647 [0] MPI startup(): Allreduce: 1: 0-3 & 0-2 [0] MPI startup(): Allreduce: 7: 4-8 & 0-2 [0] MPI startup(): Allreduce: 1: 9-16 & 0-2 [0] MPI startup(): Allreduce: 7: 17-56 & 0-2 [0] MPI startup(): Allreduce: 1: 57-128 & 0-2 [0] MPI startup(): Allreduce: 7: 129-1024 & 0-2 [0] MPI startup(): Allreduce: 1: 1025-6168 & 0-2 [0] MPI startup(): Allreduce: 8: 6169-9639 & 0-2 [0] MPI startup(): Allreduce: 1: 9640-46814 & 0-2 [0] MPI startup(): Allreduce: 8: 46815-103629 & 0-2 [0] MPI startup(): Allreduce: 1: 103630-869643 & 0-2 [0] MPI startup(): Allreduce: 8: 869644-1048576 & 0-2 [0] MPI startup(): Allreduce: 2: 1048577-2097152 & 0-2 [0] MPI startup(): Allreduce: 7: 0-2147483647 & 0-2 [0] MPI startup(): Allreduce: 7: 0-16 & 3-4 [0] MPI startup(): Allreduce: 1: 17-32 & 3-4 [0] MPI startup(): Allreduce: 7: 33-2048 & 3-4 [0] MPI startup(): Allreduce: 1: 2049-4563 & 3-4 [0] MPI startup(): Allreduce: 2: 4564-10663 & 3-4 [0] MPI startup(): Allreduce: 8: 10664-16384 & 3-4 [0] MPI startup(): Allreduce: 1: 16385-36829 & 3-4 [0] MPI startup(): Allreduce: 6: 36830-120483 & 3-4 [0] MPI startup(): Allreduce: 8: 120484-139911 & 3-4 [0] MPI startup(): Allreduce: 6: 139912-262144 & 3-4 [0] MPI startup(): Allreduce: 7: 262145-952979 & 3-4 [0] MPI startup(): Allreduce: 4: 952980-1454861 & 3-4 [0] MPI startup(): Allreduce: 6: 0-2147483647 & 3-4 [0] MPI startup(): Allreduce: 1: 0-192 & 5-8 [0] MPI startup(): Allreduce: 4: 193-2048 & 5-8 [0] MPI startup(): Allreduce: 1: 2049-4538 & 5-8 [0] MPI startup(): Allreduce: 4: 4539-201699 & 5-8 [0] MPI startup(): Allreduce: 8: 201700-295993 & 5-8 [0] MPI startup(): Allreduce: 4: 0-2147483647 & 5-8 [0] MPI startup(): Allreduce: 1: 0-512 & 9-16 [0] MPI startup(): Allreduce: 6: 513-1024 & 9-16 [0] MPI startup(): Allreduce: 1: 1025-3337 & 9-16 [0] MPI startup(): Allreduce: 2: 3338-4096 & 9-16 [0] MPI startup(): Allreduce: 4: 4097-8192 & 9-16 [0] MPI startup(): Allreduce: 2: 8193-384947 & 9-16 [0] MPI startup(): Allreduce: 8: 384948-572637 & 9-16 [0] MPI startup(): Allreduce: 2: 572638-2097152 & 9-16 [0] MPI startup(): Allreduce: 4: 0-2147483647 & 9-16 [0] MPI startup(): Allreduce: 6: 0-4 & 17-2147483647 [0] MPI startup(): Allreduce: 4: 5-8 & 17-2147483647 [0] MPI startup(): Allreduce: 1: 9-128 & 17-2147483647 [0] MPI startup(): Allreduce: 6: 129-512 & 17-2147483647 [0] MPI startup(): Allreduce: 4: 513-1632 & 17-2147483647 [0] MPI startup(): Allreduce: 1: 1633-3150 & 17-2147483647 [0] MPI startup(): Allreduce: 2: 3151-8192 & 17-2147483647 [0] MPI startup(): Allreduce: 6: 8193-16384 & 17-2147483647 [0] MPI startup(): Allreduce: 4: 16385-32768 & 17-2147483647 [0] MPI startup(): Allreduce: 6: 32769-65536 & 17-2147483647 [0] MPI startup(): Allreduce: 2: 65537-598520 & 17-2147483647 [0] MPI startup(): Allreduce: 6: 598521-1048576 & 17-2147483647 [0] MPI startup(): Allreduce: 4: 0-2147483647 & 17-2147483647 [0] MPI startup(): Alltoall: 3: 0-2147483647 & 0-2 [0] MPI startup(): Alltoall: 2: 0-330469 & 3-4 [0] MPI startup(): Alltoall: 3: 0-2147483647 & 3-4 [0] MPI startup(): Alltoall: 2: 0-503100 & 5-8 [0] MPI startup(): Alltoall: 3: 0-2147483647 & 5-8 [0] MPI startup(): Alltoall: 2: 1-326935 & 9-16 [0] MPI startup(): Alltoall: 3: 0-2147483647 & 9-16 [0] MPI startup(): Alltoall: 1: 0-39 & 17-2147483647 [0] MPI startup(): Alltoall: 2: 40-449014 & 17-2147483647 [0] MPI startup(): Alltoall: 3: 0-2147483647 & 17-2147483647 [0] MPI startup(): Alltoallv: 1: 0-2147483647 & 0-2147483647 [0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647 [0] MPI startup(): Barrier: 6: 0-2147483647 & 0-4 [0] MPI startup(): Barrier: 1: 0-2147483647 & 5-2147483647 [0] MPI startup(): Bcast: 1: 1-2 & 0-2 [0] MPI startup(): Bcast: 7: 3-256 & 0-2 [0] MPI startup(): Bcast: 1: 257-1411 & 0-2 [0] MPI startup(): Bcast: 3: 1412-2048 & 0-2 [0] MPI startup(): Bcast: 7: 2049-4096 & 0-2 [0] MPI startup(): Bcast: 1: 4097-16384 & 0-2 [0] MPI startup(): Bcast: 7: 16385-46594 & 0-2 [0] MPI startup(): Bcast: 3: 46595-94606 & 0-2 [0] MPI startup(): Bcast: 1: 94607-131072 & 0-2 [0] MPI startup(): Bcast: 7: 131073-262144 & 0-2 [0] MPI startup(): Bcast: 1: 0-2147483647 & 0-2 [0] MPI startup(): Bcast: 1: 1-2 & 3-4 [0] MPI startup(): Bcast: 7: 3-4 & 3-4 [0] MPI startup(): Bcast: 1: 5-8 & 3-4 [0] MPI startup(): Bcast: 7: 9-16 & 3-4 [0] MPI startup(): Bcast: 1: 17-2817 & 3-4 [0] MPI startup(): Bcast: 3: 2818-10574 & 3-4 [0] MPI startup(): Bcast: 1: 10575-40846 & 3-4 [0] MPI startup(): Bcast: 2: 40847-65536 & 3-4 [0] MPI startup(): Bcast: 3: 65537-154493 & 3-4 [0] MPI startup(): Bcast: 4: 154494-524288 & 3-4 [0] MPI startup(): Bcast: 3: 524289-2097152 & 3-4 [0] MPI startup(): Bcast: 2: 0-2147483647 & 3-4 [0] MPI startup(): Bcast: 1: 0-1024 & 5-8 [0] MPI startup(): Bcast: 7: 1025-3705 & 5-8 [0] MPI startup(): Bcast: 3: 3706-16384 & 5-8 [0] MPI startup(): Bcast: 7: 16385-356433 & 5-8 [0] MPI startup(): Bcast: 2: 0-2147483647 & 5-8 [0] MPI startup(): Bcast: 1: 0-2 & 9-16 [0] MPI startup(): Bcast: 7: 3-346314 & 9-16 [0] MPI startup(): Bcast: 2: 0-2147483647 & 9-16 [0] MPI startup(): Bcast: 7: 0-662700 & 17-2147483647 [0] MPI startup(): Bcast: 2: 0-2147483647 & 17-2147483647 [0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647 [0] MPI startup(): Gather: 3: 0-832 & 0-2 [0] MPI startup(): Gather: 1: 833-1024 & 0-2 [0] MPI startup(): Gather: 3: 1025-16384 & 0-2 [0] MPI startup(): Gather: 1: 16385-40960 & 0-2 [0] MPI startup(): Gather: 3: 0-2147483647 & 0-2 [0] MPI startup(): Gather: 3: 0-2147483647 & 3-4 [0] MPI startup(): Gather: 3: 0-19103 & 5-8 [0] MPI startup(): Gather: 2: 19104-32768 & 5-8 [0] MPI startup(): Gather: 3: 0-2147483647 & 5-8 [0] MPI startup(): Gather: 3: 0-2147483647 & 9-16 [0] MPI startup(): Gather: 3: 0-135094 & 17-2147483647 [0] MPI startup(): Gather: 2: 135095-465811 & 17-2147483647 [0] MPI startup(): Gather: 3: 465812-2137677 & 17-2147483647 [0] MPI startup(): Gather: 2: 0-2147483647 & 17-2147483647 [0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647 [0] MPI startup(): Reduce_scatter: 4: 0-5 & 0-2 [0] MPI startup(): Reduce_scatter: 2: 6-974288 & 0-2 [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 0-2 [0] MPI startup(): Reduce_scatter: 4: 0-4 & 3-4 [0] MPI startup(): Reduce_scatter: 1: 5-8 & 3-4 [0] MPI startup(): Reduce_scatter: 3: 9-88 & 3-4 [0] MPI startup(): Reduce_scatter: 1: 89-214 & 3-4 [0] MPI startup(): Reduce_scatter: 3: 215-8192 & 3-4 [0] MPI startup(): Reduce_scatter: 2: 8193-16384 & 3-4 [0] MPI startup(): Reduce_scatter: 1: 16385-48037 & 3-4 [0] MPI startup(): Reduce_scatter: 3: 48038-80497 & 3-4 [0] MPI startup(): Reduce_scatter: 2: 80498-481614 & 3-4 [0] MPI startup(): Reduce_scatter: 5: 481615-1048576 & 3-4 [0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 3-4 [0] MPI startup(): Reduce_scatter: 4: 0-4 & 5-8 [0] MPI startup(): Reduce_scatter: 5: 5-8 & 5-8 [0] MPI startup(): Reduce_scatter: 1: 9-2048 & 5-8 [0] MPI startup(): Reduce_scatter: 3: 2049-16384 & 5-8 [0] MPI startup(): Reduce_scatter: 1: 16385-32768 & 5-8 [0] MPI startup(): Reduce_scatter: 3: 32769-74620 & 5-8 [0] MPI startup(): Reduce_scatter: 1: 74621-131471 & 5-8 [0] MPI startup(): Reduce_scatter: 2: 131472-524288 & 5-8 [0] MPI startup(): Reduce_scatter: 5: 524289-1048576 & 5-8 [0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 5-8 [0] MPI startup(): Reduce_scatter: 4: 1-4 & 9-16 [0] MPI startup(): Reduce_scatter: 1: 5-328 & 9-16 [0] MPI startup(): Reduce_scatter: 3: 329-512 & 9-16 [0] MPI startup(): Reduce_scatter: 1: 513-14699 & 9-16 [0] MPI startup(): Reduce_scatter: 3: 14700-65536 & 9-16 [0] MPI startup(): Reduce_scatter: 1: 65537-131072 & 9-16 [0] MPI startup(): Reduce_scatter: 2: 131073-262144 & 9-16 [0] MPI startup(): Reduce_scatter: 5: 262145-1048576 & 9-16 [0] MPI startup(): Reduce_scatter: 2: 1048577-2097152 & 9-16 [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 9-16 [0] MPI startup(): Reduce_scatter: 4: 1-4 & 17-2147483647 [0] MPI startup(): Reduce_scatter: 1: 5-16384 & 17-2147483647 [0] MPI startup(): Reduce_scatter: 3: 16385-262144 & 17-2147483647 [0] MPI startup(): Reduce_scatter: 2: 262145-2097152 & 17-2147483647 [0] MPI startup(): Reduce_scatter: 5: 0-2147483647 & 17-2147483647 [0] MPI startup(): Reduce: 1: 0-10151 & 0-2 [0] MPI startup(): Reduce: 2: 10152-48743 & 0-2 [0] MPI startup(): Reduce: 5: 48744-97542 & 0-2 [0] MPI startup(): Reduce: 2: 97543-160324 & 0-2 [0] MPI startup(): Reduce: 1: 160325-438895 & 0-2 [0] MPI startup(): Reduce: 2: 0-2147483647 & 0-2 [0] MPI startup(): Reduce: 1: 1-3198 & 3-4 [0] MPI startup(): Reduce: 2: 3199-6826 & 3-4 [0] MPI startup(): Reduce: 5: 6827-8884 & 3-4 [0] MPI startup(): Reduce: 2: 8885-37418 & 3-4 [0] MPI startup(): Reduce: 4: 37419-85330 & 3-4 [0] MPI startup(): Reduce: 1: 85331-2663119 & 3-4 [0] MPI startup(): Reduce: 5: 0-2147483647 & 3-4 [0] MPI startup(): Reduce: 1: 1-7108 & 5-8 [0] MPI startup(): Reduce: 4: 7109-9688 & 5-8 [0] MPI startup(): Reduce: 1: 9689-3402466 & 5-8 [0] MPI startup(): Reduce: 3: 0-2147483647 & 5-8 [0] MPI startup(): Reduce: 1: 0-2147483647 & 9-2147483647 [0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647 [0] MPI startup(): Scatter: 1: 1-16384 & 0-2 [0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2 [0] MPI startup(): Scatter: 3: 1-32768 & 3-4 [0] MPI startup(): Scatter: 2: 32769-131072 & 3-4 [0] MPI startup(): Scatter: 3: 131073-524288 & 3-4 [0] MPI startup(): Scatter: 2: 524289-2097152 & 3-4 [0] MPI startup(): Scatter: 3: 0-2147483647 & 3-4 [0] MPI startup(): Scatter: 3: 0-2147483647 & 5-16 [0] MPI startup(): Scatter: 3: 1-1934166 & 17-2147483647 [0] MPI startup(): Scatter: 2: 1934167-2165819 & 17-2147483647 [0] MPI startup(): Scatter: 3: 0-2147483647 & 17-2147483647 [0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647 [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 199966 node119 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23} [0] MPI startup(): 1 60548 node120 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23} [0] MPI startup(): Recognition=2 Platform(code=16 ippn=0 dev=17) Fabric(intra=1 inter=7 flags=0x0) [1] MPI startup(): Recognition=2 Platform(code=16 ippn=0 dev=17) Fabric(intra=1 inter=7 flags=0x0) [0] MPI startup(): I_MPI_COLL_INTRANODE=pt2pt [0] MPI startup(): I_MPI_DEBUG=6 [0] MPI startup(): I_MPI_FABRICS=shm:ofi [0] MPI startup(): I_MPI_HYDRA_UUID=060d0300-c981-9c5a-a96f-05007077c0a8 [0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=qib0:0,qib1:1 [0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2 [0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0 [0] MPI startup(): Intel(R) MPI Library, Version 2018 Build 20170713 (id: 17594) [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation. All rights reserved. [0] MPI startup(): Multi-threaded optimized library [0] MPID_nem_ofi_dump_providers(): Dumping Providers(first=0x64db20 (psm)): [0] MPID_nem_ofi_dump_providers(): psm [0] MPID_nem_ofi_dump_providers(): verbs [0] MPID_nem_ofi_dump_providers(): verbs [0] MPID_nem_ofi_dump_providers(): verbs;ofi_rxm [0] MPID_nem_ofi_dump_providers(): verbs;ofi_rxm [0] MPID_nem_ofi_dump_providers(): verbs;ofi_rxm [0] MPID_nem_ofi_dump_providers(): verbs;ofi_rxm [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_dump_providers(): sockets [0] MPID_nem_ofi_init(): used OFI provider: psm [0] MPID_nem_ofi_init(): max_buffered_send 64 [0] MPID_nem_ofi_init(): max_msg_size 64 [0] MPID_nem_ofi_init(): rcd switchover 32768 [0] MPID_nem_ofi_init(): cq entries count 8 [0] MPID_nem_ofi_init(): MPID_REQUEST_PREALLOC 128 node120: Unable to allocate shared memory for intra-node messaging. node120: Delete stale shared memory files in /dev/shm. node119: Unable to allocate shared memory for intra-node messaging. node119: Delete stale shared memory files in /dev/shm.

yuezhu1 avatar Jun 28 '18 01:06 yuezhu1

Please ensure you can login the nodes with password-less via ssh (ssh node119). this check is included in scripts/run_intelcaffe.sh. You can run your case with the script as: ./scripts/run_intelcaffe.sh --hostfile /path/to/caffe/hostfile --solver /path/to/caffe/examples/cifar10/cifar10_quick_solver.prototxt

fzou1 avatar Jun 28 '18 01:06 fzou1

Many thanks for all your kind helps. I think the problem is caused by the redundant hostfile parameters when running mpi on cluster.

Right now, my problem has been solved. You can close this issue.

Thank you again!

yuezhu1 avatar Jun 28 '18 18:06 yuezhu1

I believe if you remove "-f hostfile" in command, you should get 4 processes running on one node. You need to check if there is caffe process on each node.

fzou1 avatar Jun 29 '18 02:06 fzou1

Many thanks for you reminder. You are right. I only get 4 processes on one node. We don't need hostfile on OpenMPI when running on multiple nodes, but hostfile seems like is needed for Intel-MPI. I guess the problem comes from MLSL initialization. I downloaded MLSL library, and tried some built in test programs. It met the same error (cannot allocate memory for intra-node messaging) when I tired to use command like "mpirun -n 4 -ppn 1 -f hostfile ./mlsl_test 1 1 1 0". I wonder if you have any clues on this error.

Thank you!

yuezhu1 avatar Jun 29 '18 15:06 yuezhu1

suggest you go ahead opening a new issue on MLSL since it seems more like a failure of MLSL initialization caused by an environment issue .

manofmountain avatar Jul 06 '18 03:07 manofmountain