SuperPoint
SuperPoint copied to clipboard
Muti-GPU Training
Hi
I am trying to train the network with Multi-GPU. The training works fine with two GPUs but when I tried to train the network with more than two GPUs it gave an error. The log file is as below.
Command: python experiment.py train configs/magic-point_shapes.yaml magic-point_synth
[09/27/2018 10:22:23 INFO] Running command TRAIN [09/27/2018 10:22:24 INFO] Number of GPUs detected: 4 [09/27/2018 10:22:26 INFO] Extracting archive for primitive draw_lines. [09/27/2018 10:22:29 INFO] Extracting archive for primitive draw_polygon. [09/27/2018 10:22:35 INFO] Extracting archive for primitive draw_multiple_polygons. [09/27/2018 10:22:43 INFO] Extracting archive for primitive draw_ellipses. [09/27/2018 10:22:55 INFO] Extracting archive for primitive draw_star. [09/27/2018 10:23:10 INFO] Extracting archive for primitive draw_checkerboard. [09/27/2018 10:23:30 INFO] Extracting archive for primitive draw_stripes. [09/27/2018 10:23:53 INFO] Extracting archive for primitive draw_cube. [09/27/2018 10:24:19 INFO] Extracting archive for primitive gaussian_noise. [09/27/2018 10:24:50 INFO] Caching data, fist access will take some time. [09/27/2018 10:24:51 INFO] Caching data, fist access will take some time. [09/27/2018 10:24:51 INFO] Caching data, fist access will take some time. 2018-09-27 10:24:51.580540: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2018-09-27 10:24:52.087714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1b:00.0 totalMemory: 11.91GiB freeMemory: 11.68GiB 2018-09-27 10:24:52.395446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1c:00.0 totalMemory: 11.91GiB freeMemory: 11.74GiB 2018-09-27 10:24:52.683678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1d:00.0 totalMemory: 11.91GiB freeMemory: 11.74GiB 2018-09-27 10:24:52.990422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1e:00.0 totalMemory: 11.91GiB freeMemory: 11.74GiB 2018-09-27 10:24:52.997493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3 2018-09-27 10:24:54.186225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-27 10:24:54.186264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3 2018-09-27 10:24:54.186270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y 2018-09-27 10:24:54.186273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y 2018-09-27 10:24:54.186276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y 2018-09-27 10:24:54.186279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N 2018-09-27 10:24:54.186907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11305 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:1b:00.0, compute capability: 6.1) 2018-09-27 10:24:54.356548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11363 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:1c:00.0, compute capability: 6.1) 2018-09-27 10:24:54.531258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11363 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:1d:00.0, compute capability: 6.1) 2018-09-27 10:24:54.704548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11363 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:1e:00.0, compute capability: 6.1) [09/27/2018 10:24:55 INFO] Scale of 0 disables regularizer. [09/27/2018 10:24:55 INFO] Scale of 0 disables regularizer. [09/27/2018 10:24:55 INFO] Scale of 0 disables regularizer. . . . [09/27/2018 10:24:58 INFO] Scale of 0 disables regularizer. 2018-09-27 10:24:58.889130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3 2018-09-27 10:24:58.889303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-27 10:24:58.889312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3 2018-09-27 10:24:58.889317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y 2018-09-27 10:24:58.889321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y 2018-09-27 10:24:58.889325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y 2018-09-27 10:24:58.889328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N 2018-09-27 10:24:58.889819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11305 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:1b:00.0, compute capability: 6.1) 2018-09-27 10:24:58.889962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11363 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:1c:00.0, compute capability: 6.1) 2018-09-27 10:24:58.890073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11363 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:1d:00.0, compute capability: 6.1) 2018-09-27 10:24:58.890209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11363 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:1e:00.0, compute capability: 6.1) [09/27/2018 10:25:02 INFO] Start training Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input shape axis 0 must equal 200, got shape [100,120,160] [[Node: magicpoint/eval_data_sharding/unstack_3 = UnpackT=DT_INT32, axis=0, num=200, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: magicpoint/eval_tower2/map/while/box_nms/non_max_suppression/NonMaxSuppressionV3/_1553 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:2", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1376_...pressionV3", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:2"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "experiment.py", line 151, in
Caused by op 'magicpoint/eval_data_sharding/unstack_3', defined at:
File "experiment.py", line 151, in
InvalidArgumentError (see above for traceback): Input shape axis 0 must equal 200, got shape [100,120,160] [[Node: magicpoint/eval_data_sharding/unstack_3 = UnpackT=DT_INT32, axis=0, num=200, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: magicpoint/eval_tower2/map/while/box_nms/non_max_suppression/NonMaxSuppressionV3/_1553 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:2", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1376_...pressionV3", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:2"]]
Does the code supports more than two GPUs? Thank you.
Theoretically, it should support more than 2 GPUs, but in practice I never tried.
What batch size did you use? The problem could come from the fact that this number is not divisible by the number of GPUs you used.
Thank you for your reply. All default parameters are used. I used the batch size of 64 as mentioned in magic-point_shapes.yaml I tried the number GPUs from 2 to 8, but it only worked with 2 GPUs.
Ok, I think I have an explanation for your issue.
According to the error message, the issue lies in the evaluation, not in the training. If you indeed used the default parameters of configs/magic-point_shapes.yaml
, then you have a validation set of size 500 and you use a batch size of 50 for the evaluation. Using 4 GPUs for example, the first step of the evaluation will use 50*4=200 images, the second step will use again 200 images and the last step will have only the 100 images which are left. But the code was expecting a batch of 200 images as before, hence the error.
So I would suggest a quick fix: change the size of the evaluation set (parameter data->validation_size in the config file) to a number that is divisible by eval_batch_size x num_GPUs. So for example if you want to keep a batch size (for the evaluation) of 50 and use 4 GPUs, then you can choose a validation set of size 600 (divisible by 50 x 4).
I hope it solves your issue.
Has it solved your problem, @muk3250 ?
Sorry for late reply. I have tried with different validation_size such that it is divisible by the number of GPUs but problem persist.
Hum, then I don't know what is causing this problem... The code has already been successfully used with more than 2 GPUs in the past, so I don't see why it shouldn't work with you.
I guess that in the meantime using 2 GPUs is already enough, training and prediction with SuperPoint are quite fast anyway.
I tried to train the MagicPoint models by changing different "preprocessing->resize" setting in the file "magic-point_shapes.yaml", but the OOM happens all the way. I use 2 TITAN X with 12 GB VRAM. Please help me to change the settings so that my system can run. Thank you!
What image size did you use? And what was your batch size? With those two GPUs you should be able to run MagicPoint without reaching an OOM, given a sufficiently low batch size.
The image size was 120x160, the batch size was 64 as default. My GPUs consumed nearly 12GB VRAM with any settings.
I see. Can you try with a lower batch size? I personally trained it with a batch size of 32, with a single GPU with 11 GB of memory, so it should at least work for you with a batch size of 32 I guess.
Thank you for your response. I just changed the batch size to 32 and other parameters were kept as default. I set one GPU for display, another for training, but the error still happened. It took up 11725 MB VRAM.
"F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) Aborted (core dumped)"
But is this an OOM error? What is the full error message exactly?
Problem was solved by downgrading my TF from 1.12 to 1.10 with CUDA 9.1. I am moving to the next steps. Thank you anyway!
Ok, good to know!
I personally trained it with a batch size of 32, with a single GPU with 11 GB of memory, so it should at least work for you with a batch size of 32 I guess.
@rpautrat Was this training setup specifically for the magicc point training or for the entire SuperPoint training as well? I just tried to start training last night and quickly realized my hardware will not be enough. Looking at different options and trying to figure out just how much VRAM I need.
That was for MagicPoint only. For SuperPoint I had to reduce the batch size to 3 (with the same GPU) to avoid the OOM error.
I trained Superpoint on MS-COCO and got memory leaks at step 6 with the default settings. It gradually consumed ~25GB of RAM and our system crashed. Changing the image resolution or batch size couldn't help. Do you guys have any idea for this problem? Thank you!
It is weird that you have this increasing memory consumption, as if there was a memory leak... I have never observed that and I don't see where there could be such a leak.
When you said 25GB of RAM, you meant GPU memory, right?
One other possibility is to modify the structure itself of the network: instead of using patches of size 8x8, you can use bigger patches (like 16x16). Since most of the work is done with the downsampled images, it should consume less GPU memory that way. But you would need to adapt the rest of the code as well (for example the hyperparameter balancing the number of positive and negative matches in the descriptor loss) et and the performance might be a bit worse than the original network.
No, it was RAM, not VRAM. The problem looks like this
Oh I see, then changing the batch size, the resolution or the size of the small patches won't help indeed.
Then you probably have the same issue as in https://github.com/rpautrat/SuperPoint/issues/22#issuecomment-455394267. I tried to suggest some methods to find a potential memory leak in this thread, but I don't know if @lidongjiangBJTU has solved the problem in the end.
@rpautrat I'm not yet to the SuperPoint training yet, but is there a chance these memory leak problems are related to the cache_in_memory
parameter? I notice that in the superpoint_coco.yaml
the cache_in_memory
parameter defaults to True
while it is set to False
in the other config files. I looked into the code, and it doesn't seem it should make a difference, but I thought I'd bring it up because I noticed this difference.
Good point! Caching shouldn't be a problem normally because you cache it only once (and it should fit in memory), but there might be a bug with this somewhere.
@TienPhuocNguyen, can you try to train SuperPoint without the parameter cache_in_memory
set to False to see if it helps?
Thank you guys, this change helps my system working. However, I have to reduce the training batch size and the evaluating batch size to avoid the OOM problem on GPU.
Great! Yes, I guess that you have no other choice than reducing the batch size for the GPU, but it shouldn't impact too much the quality of your training.
Sorry for late reply. I have tried with different validation_size such that it is divisible by the number of GPUs but problem persist.
Same here. May I ask did you solve the problem?
Thank you for your reply. All default parameters are used. I used the batch size of 64 as mentioned in magic-point_shapes.yaml I tried the number GPUs from 2 to 8, but it only worked with 2 GPUs.
The current code only support training on 2 GPUS during the iterative steps between steps 2 and 3 in README.