Use infiniband
My current configuration encountered some problems.
I1205 17:02:12.401198 7160 layer_factory.hpp:77] Creating layer data I1205 17:02:12.401211 7160 net.cpp:99] Creating Layer data I1205 17:02:12.401216 7160 net.cpp:407] data -> data I1205 17:02:12.401224 7160 net.cpp:407] data -> label I1205 17:02:12.401321 7160 net.cpp:149] Setting up data I1205 17:02:12.401330 7160 net.cpp:156] Top shape: 100 1 28 28 (78400) I1205 17:02:12.401335 7160 net.cpp:156] Top shape: 100 (100) I1205 17:02:12.401337 7160 net.cpp:164] Memory required for data: 314000 I1205 17:02:12.401341 7160 layer_factory.hpp:77] Creating layer label_data_1_split I1205 17:02:12.401347 7160 net.cpp:99] Creating Layer label_data_1_split I1205 17:02:12.401351 7160 net.cpp:433] label_data_1_split <- label I1205 17:02:12.401356 7160 net.cpp:407] label_data_1_split -> label_data_1_split_0 I1205 17:02:12.401362 7160 net.cpp:407] label_data_1_split -> label_data_1_split_1 I1205 17:02:12.401396 7160 net.cpp:149] Setting up label_data_1_split I1205 17:02:12.401402 7160 net.cpp:156] Top shape: 100 (100) I1205 17:02:12.401407 7160 net.cpp:156] Top shape: 100 (100) I1205 17:02:12.401409 7160 net.cpp:164] Memory required for data: 314800 I1205 17:02:12.401412 7160 layer_factory.hpp:77] Creating layer conv1 I1205 17:02:12.401422 7160 net.cpp:99] Creating Layer conv1 I1205 17:02:12.401425 7160 net.cpp:433] conv1 <- data I1205 17:02:12.401430 7160 net.cpp:407] conv1 -> conv1 I1205 17:02:12.402066 7160 net.cpp:149] Setting up conv1 I1205 17:02:12.402081 7160 net.cpp:156] Top shape: 100 20 24 24 (1152000) I1205 17:02:12.402084 7160 net.cpp:164] Memory required for data: 4922800 I1205 17:02:12.402097 7160 layer_factory.hpp:77] Creating layer pool1 I1205 17:02:12.402107 7160 net.cpp:99] Creating Layer pool1 I1205 17:02:12.402110 7160 net.cpp:433] pool1 <- conv1 I1205 17:02:12.402115 7160 net.cpp:407] pool1 -> pool1 I1205 17:02:12.402153 7160 net.cpp:149] Setting up pool1 I1205 17:02:12.402161 7160 net.cpp:156] Top shape: 100 20 12 12 (288000) I1205 17:02:12.402164 7160 net.cpp:164] Memory required for data: 6074800 I1205 17:02:12.402168 7160 layer_factory.hpp:77] Creating layer conv2 I1205 17:02:12.402176 7160 net.cpp:99] Creating Layer conv2 I1205 17:02:12.402180 7160 net.cpp:433] conv2 <- pool1 I1205 17:02:12.402186 7160 net.cpp:407] conv2 -> conv2 I1205 17:02:12.403599 7160 net.cpp:149] Setting up conv2 I1205 17:02:12.403615 7160 net.cpp:156] Top shape: 100 50 8 8 (320000) I1205 17:02:12.403620 7160 net.cpp:164] Memory required for data: 7354800 I1205 17:02:12.403630 7160 layer_factory.hpp:77] Creating layer pool2 I1205 17:02:12.403637 7160 net.cpp:99] Creating Layer pool2 I1205 17:02:12.403641 7160 net.cpp:433] pool2 <- conv2 I1205 17:02:12.403647 7160 net.cpp:407] pool2 -> pool2 I1205 17:02:12.403690 7160 net.cpp:149] Setting up pool2 I1205 17:02:12.403698 7160 net.cpp:156] Top shape: 100 50 4 4 (80000) I1205 17:02:12.403702 7160 net.cpp:164] Memory required for data: 7674800 I1205 17:02:12.403705 7160 layer_factory.hpp:77] Creating layer ip1 I1205 17:02:12.403713 7160 net.cpp:99] Creating Layer ip1 I1205 17:02:12.403717 7160 net.cpp:433] ip1 <- pool2 I1205 17:02:12.403723 7160 net.cpp:407] ip1 -> ip1 I1205 17:02:12.406860 7160 net.cpp:149] Setting up ip1 I1205 17:02:12.406877 7160 net.cpp:156] Top shape: 100 500 (50000) I1205 17:02:12.406879 7160 net.cpp:164] Memory required for data: 7874800 I1205 17:02:12.406890 7160 layer_factory.hpp:77] Creating layer relu1 I1205 17:02:12.406898 7160 net.cpp:99] Creating Layer relu1 I1205 17:02:12.406901 7160 net.cpp:433] relu1 <- ip1 I1205 17:02:12.406909 7160 net.cpp:394] relu1 -> ip1 (in-place) I1205 17:02:12.407634 7160 net.cpp:149] Setting up relu1 I1205 17:02:12.407649 7160 net.cpp:156] Top shape: 100 500 (50000) I1205 17:02:12.407654 7160 net.cpp:164] Memory required for data: 8074800 I1205 17:02:12.407657 7160 layer_factory.hpp:77] Creating layer ip2 I1205 17:02:12.407667 7160 net.cpp:99] Creating Layer ip2 I1205 17:02:12.407672 7160 net.cpp:433] ip2 <- ip1 I1205 17:02:12.407680 7160 net.cpp:407] ip2 -> ip2 I1205 17:02:12.407815 7160 net.cpp:149] Setting up ip2 I1205 17:02:12.407825 7160 net.cpp:156] Top shape: 100 10 (1000) I1205 17:02:12.407829 7160 net.cpp:164] Memory required for data: 8078800 I1205 17:02:12.407835 7160 layer_factory.hpp:77] Creating layer ip2_ip2_0_split I1205 17:02:12.407840 7160 net.cpp:99] Creating Layer ip2_ip2_0_split I1205 17:02:12.407843 7160 net.cpp:433] ip2_ip2_0_split <- ip2 I1205 17:02:12.407848 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_0 I1205 17:02:12.407856 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_1 I1205 17:02:12.407891 7160 net.cpp:149] Setting up ip2_ip2_0_split I1205 17:02:12.407898 7160 net.cpp:156] Top shape: 100 10 (1000) I1205 17:02:12.407902 7160 net.cpp:156] Top shape: 100 10 (1000) I1205 17:02:12.407904 7160 net.cpp:164] Memory required for data: 8086800 I1205 17:02:12.407908 7160 layer_factory.hpp:77] Creating layer accuracy I1205 17:02:12.407917 7160 net.cpp:99] Creating Layer accuracy I1205 17:02:12.407920 7160 net.cpp:433] accuracy <- ip2_ip2_0_split_0 I1205 17:02:12.407924 7160 net.cpp:433] accuracy <- label_data_1_split_0 I1205 17:02:12.407930 7160 net.cpp:407] accuracy -> accuracy I1205 17:02:12.407939 7160 net.cpp:149] Setting up accuracy I1205 17:02:12.407944 7160 net.cpp:156] Top shape: (1) I1205 17:02:12.407948 7160 net.cpp:164] Memory required for data: 8086804 I1205 17:02:12.407950 7160 layer_factory.hpp:77] Creating layer loss I1205 17:02:12.407954 7160 net.cpp:99] Creating Layer loss I1205 17:02:12.407958 7160 net.cpp:433] loss <- ip2_ip2_0_split_1 I1205 17:02:12.407963 7160 net.cpp:433] loss <- label_data_1_split_1 I1205 17:02:12.407966 7160 net.cpp:407] loss -> loss I1205 17:02:12.407972 7160 layer_factory.hpp:77] Creating layer loss I1205 17:02:12.408217 7160 net.cpp:149] Setting up loss I1205 17:02:12.408229 7160 net.cpp:156] Top shape: (1) I1205 17:02:12.408233 7160 net.cpp:159] with loss weight 1 I1205 17:02:12.408239 7160 net.cpp:164] Memory required for data: 8086808 I1205 17:02:12.408243 7160 net.cpp:225] loss needs backward computation. I1205 17:02:12.408248 7160 net.cpp:227] accuracy does not need backward computation. I1205 17:02:12.408252 7160 net.cpp:225] ip2_ip2_0_split needs backward computation. I1205 17:02:12.408255 7160 net.cpp:225] ip2 needs backward computation. I1205 17:02:12.408258 7160 net.cpp:225] relu1 needs backward computation. I1205 17:02:12.408262 7160 net.cpp:225] ip1 needs backward computation. I1205 17:02:12.408263 7160 net.cpp:225] pool2 needs backward computation. I1205 17:02:12.408267 7160 net.cpp:225] conv2 needs backward computation. I1205 17:02:12.408270 7160 net.cpp:225] pool1 needs backward computation. I1205 17:02:12.408272 7160 net.cpp:225] conv1 needs backward computation. I1205 17:02:12.408277 7160 net.cpp:227] label_data_1_split does not need backward computation. I1205 17:02:12.408279 7160 net.cpp:227] data does not need backward computation. I1205 17:02:12.408282 7160 net.cpp:269] This network produces output accuracy I1205 17:02:12.408288 7160 net.cpp:269] This network produces output loss I1205 17:02:12.408299 7160 net.cpp:282] Network initialization done. I1205 17:02:12.408339 7160 solver.cpp:60] Solver scaffolding done. I1205 17:02:12.411540 7160 CaffeNet.cpp:240] RDMA adapter: mlx5_0 I1205 17:02:12.414819 7160 CaffeNet.cpp:388] 0-th RDMA addr: 01000000360100000899f800 I1205 17:02:12.414834 7160 CaffeNet.cpp:388] 1-th RDMA addr: I1205 17:02:12.414849 7160 JniCaffeNet.cpp:145] 0-th local addr: 01000000360100000899f800 I1205 17:02:12.414856 7160 JniCaffeNet.cpp:145] 1-th local addr: 17/12/05 17:02:12 INFO executor.Executor: Finished task 1.0 in stage 2.0 (TID 5). 931 bytes result sent to driver 17/12/05 17:02:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7 17/12/05 17:02:12 INFO executor.Executor: Running task 1.0 in stage 3.0 (TID 7) 17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4 17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1565.0 B, free 18.9 KB) 17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 14 ms 17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 21.4 KB) 17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 105.0 B, free 21.5 KB) 17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 11 ms 17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 392.0 B, free 21.9 KB) I1205 17:02:12.636529 7160 common.cpp:61] 1-th string is NULL F1205 17:02:12.639581 7160 rdma.cpp:327] Check failed: self_ Failed to register memory region.
infiniband information is as follows
omnisky@slave1:~/zzh/mnist$ ibstat CA 'mlx5_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.21.1000 Hardware version: 0 Node GUID: 0xec0d9a0300397dc2 System image GUID: 0xec0d9a0300397dc2 Port 1: State: Down Physical state: Polling Rate: 10 Base lid: 2 LMC: 0 SM lid: 2 Capability mask: 0x2651e84a Port GUID: 0xec0d9a0300397dc2 Link layer: InfiniBand
I want know spark how to use infiniband , need to modify those configuration files or change infiniband's config . Please help me.
from your ibstat log:
Port 1: State: Down
Your port is down. Please get a local expert to help you with Infiniband adapters, verify your connection is correct, before you try CaffeOnSpark. Since everybody's setup is different, we don't have the bandwidth to troubleshoot your hardware settings.
I met the same problem.
RDMABuffer::RDMABuffer(RDMAChannel* channel, uint8_t* addr, size_t size) : channel_(channel), addr_(addr), size_(size) {
//*******************************************************
// case 1: Use cpu memory ibv_reg_mr() is ok, but some code is not work.
// addr_ = reinterpret_cast<uint8_t*>(malloc(size));
//
// http://server01:8042/node/containerlogs/container_1512543960414_0001_01_000003/root/stderr/?start=0
// F1206 02:14:43.892500 18704 math_functions.cu:79] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
// *** Check failure stack trace: ***
//
// case 2: Use gpu memory ibv_reg_mr() is not ok, help me.
// CUDA_CHECK(cudaMalloc(&addr_, size));
//
// http://server01:8042/node/containerlogs/container_1512543960414_0001_01_000003/root/stderr/?start=0
// F1205 17:02:12.639581 7160 rdma.cpp:327] Check failed: self_ Failed to register memory region.
//*******************************************************
self_ = ibv_reg_mr(channel_->adapter_.pd_, addr_, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE); CHECK(self_) << "Failed to register memory region";
id_ = channel_->buffers_.size(); channel_->buffers_.push_back(this);
channel_->SendMR(self_, id_); peer_ = channel_->memory_regions_queue_.pop();
}
//******************************************************* root@5ec610095991:~/CaffeOnSpark/caffe-public# more Makefile.config
Refer to http://caffe.berkeleyvision.org/installation.html Parallelization over InfiniBand or RoCE INFINIBAND := 1
//******************************************************* root@server01:/rt/data/alexNet2# ibv_devices device node GUID ------ ---------------- mlx5_0 ec0d9a0300397dd2
//******************************************************* root@server01:/rt/data/alexNet2# ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.21.1000 node_guid: ec0d:9a03:0039:7dd2 sys_image_guid: ec0d:9a03:0039:7dd2 vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 Device ports: port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 2 port_lmc: 0x00 link_layer: InfiniBand
//*******************************************************
root@5ec610095991:~/CaffeOnSpark/caffe-public# nvidia-smi
Wed Dec 6 07:34:09 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69 Driver Version: 384.69 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 20% 33C P8 16W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A |
| 20% 36C P8 17W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:07:00.0 Off | N/A |
| 20% 33C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 20% 34C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:0C:00.0 Off | N/A |
| 20% 28C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:0D:00.0 Off | N/A |
| 20% 27C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:0E:00.0 Off | N/A |
| 20% 31C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:0F:00.0 Off | N/A |
| 20% 31C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
//******************************************************* [root@server00 01_basic-client-server]# docker images REPOSITORY TAG IMAGE ID CREATED SIZE docker.io/nvidia/cuda 8.0-devel 7e0c5ccdc1eb 2 weeks ago 1.681 GB
//******************************************************* Installation Mellanox OFED for Ubuntu on a Host MLNX_OFED_LINUX-4.2-1.0.0.0-ubuntu16.04-x86_64.tgz
//******************************************************* [root@server01 ~]# systemctl status nv_peer_mem ● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem module to start at boot time. Loaded: loaded (/etc/rc.d/init.d/nv_peer_mem; bad; vendor preset: disabled) Active: active (exited) since Wed 2017-12-06 05:16:08 EST; 1min 32s ago Docs: man:systemd-sysv-generator(8) Process: 2055 ExecStart=/etc/rc.d/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Dec 06 05:16:08 server01 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem module to start at boot time.... Dec 06 05:16:08 server01 nv_peer_mem[2055]: starting... OK Dec 06 05:16:08 server01 systemd[1]: Started LSB: Activates/Deactivates nv_peer_mem module to start at boot time.