SuperPoint icon indicating copy to clipboard operation
SuperPoint copied to clipboard

Muti-GPU Training

Open muk3250 opened this issue 6 years ago • 26 comments

Hi

I am trying to train the network with Multi-GPU. The training works fine with two GPUs but when I tried to train the network with more than two GPUs it gave an error. The log file is as below.

Command: python experiment.py train configs/magic-point_shapes.yaml magic-point_synth

[09/27/2018 10:22:23 INFO] Running command TRAIN [09/27/2018 10:22:24 INFO] Number of GPUs detected: 4 [09/27/2018 10:22:26 INFO] Extracting archive for primitive draw_lines. [09/27/2018 10:22:29 INFO] Extracting archive for primitive draw_polygon. [09/27/2018 10:22:35 INFO] Extracting archive for primitive draw_multiple_polygons. [09/27/2018 10:22:43 INFO] Extracting archive for primitive draw_ellipses. [09/27/2018 10:22:55 INFO] Extracting archive for primitive draw_star. [09/27/2018 10:23:10 INFO] Extracting archive for primitive draw_checkerboard. [09/27/2018 10:23:30 INFO] Extracting archive for primitive draw_stripes. [09/27/2018 10:23:53 INFO] Extracting archive for primitive draw_cube. [09/27/2018 10:24:19 INFO] Extracting archive for primitive gaussian_noise. [09/27/2018 10:24:50 INFO] Caching data, fist access will take some time. [09/27/2018 10:24:51 INFO] Caching data, fist access will take some time. [09/27/2018 10:24:51 INFO] Caching data, fist access will take some time. 2018-09-27 10:24:51.580540: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2018-09-27 10:24:52.087714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1b:00.0 totalMemory: 11.91GiB freeMemory: 11.68GiB 2018-09-27 10:24:52.395446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1c:00.0 totalMemory: 11.91GiB freeMemory: 11.74GiB 2018-09-27 10:24:52.683678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1d:00.0 totalMemory: 11.91GiB freeMemory: 11.74GiB 2018-09-27 10:24:52.990422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties: name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:1e:00.0 totalMemory: 11.91GiB freeMemory: 11.74GiB 2018-09-27 10:24:52.997493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3 2018-09-27 10:24:54.186225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-27 10:24:54.186264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3 2018-09-27 10:24:54.186270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y 2018-09-27 10:24:54.186273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y 2018-09-27 10:24:54.186276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y 2018-09-27 10:24:54.186279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N 2018-09-27 10:24:54.186907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11305 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:1b:00.0, compute capability: 6.1) 2018-09-27 10:24:54.356548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11363 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:1c:00.0, compute capability: 6.1) 2018-09-27 10:24:54.531258: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11363 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:1d:00.0, compute capability: 6.1) 2018-09-27 10:24:54.704548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11363 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:1e:00.0, compute capability: 6.1) [09/27/2018 10:24:55 INFO] Scale of 0 disables regularizer. [09/27/2018 10:24:55 INFO] Scale of 0 disables regularizer. [09/27/2018 10:24:55 INFO] Scale of 0 disables regularizer. . . . [09/27/2018 10:24:58 INFO] Scale of 0 disables regularizer. 2018-09-27 10:24:58.889130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3 2018-09-27 10:24:58.889303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-27 10:24:58.889312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3 2018-09-27 10:24:58.889317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y 2018-09-27 10:24:58.889321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y 2018-09-27 10:24:58.889325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y 2018-09-27 10:24:58.889328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N 2018-09-27 10:24:58.889819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11305 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:1b:00.0, compute capability: 6.1) 2018-09-27 10:24:58.889962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11363 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:1c:00.0, compute capability: 6.1) 2018-09-27 10:24:58.890073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11363 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:1d:00.0, compute capability: 6.1) 2018-09-27 10:24:58.890209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11363 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:1e:00.0, compute capability: 6.1) [09/27/2018 10:25:02 INFO] Start training Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input shape axis 0 must equal 200, got shape [100,120,160] [[Node: magicpoint/eval_data_sharding/unstack_3 = UnpackT=DT_INT32, axis=0, num=200, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: magicpoint/eval_tower2/map/while/box_nms/non_max_suppression/NonMaxSuppressionV3/_1553 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:2", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1376_...pressionV3", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:2"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "experiment.py", line 151, in args.func(config, output_dir, args) File "experiment.py", line 89, in _cli_train train(config, config['train_iter'], output_dir) File "experiment.py", line 27, in train keep_checkpoints=config.get('keep_checkpoints', 1)) File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 318, in train metrics = self.evaluate('validation', mute=True) File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 371, in evaluate feed_dict={self.handle: self.dataset_handles[dataset]})) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input shape axis 0 must equal 200, got shape [100,120,160] [[Node: magicpoint/eval_data_sharding/unstack_3 = UnpackT=DT_INT32, axis=0, num=200, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: magicpoint/eval_tower2/map/while/box_nms/non_max_suppression/NonMaxSuppressionV3/_1553 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:2", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1376_...pressionV3", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:2"]]

Caused by op 'magicpoint/eval_data_sharding/unstack_3', defined at: File "experiment.py", line 151, in args.func(config, output_dir, args) File "experiment.py", line 89, in _cli_train train(config, config['train_iter'], output_dir) File "experiment.py", line 21, in train with _init_graph(config) as net: File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/contextlib.py", line 82, in enter return next(self.gen) File "experiment.py", line 74, in _init_graph n_gpus=n_gpus, **config['model']) File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 122, in init self._build_graph() File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 265, in _build_graph self._eval_graph(data) File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 216, in _eval_graph tower_metrics = self._gpu_tower(data, Mode.EVAL, self.config['eval_batch_size']) File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 142, in _gpu_tower shards = self._unstack_nested_dict(data, batch_size*self.n_gpus) File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 126, in _unstack_nested_dict else tf.unstack(v, num=num, axis=0) for k, v in d.items()} File "/home/ubuntu/Downloads/SuperPoint_Test2/SuperPoint-master/superpoint/models/base_model.py", line 126, in else tf.unstack(v, num=num, axis=0) for k, v in d.items()} File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1017, in unstack return gen_array_ops.unpack(value, num=num, axis=axis, name=name) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9138, in unpack "Unpack", value=value, num=num, axis=axis, name=name) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/home/ubuntu/anaconda3/envs/SP_test/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Input shape axis 0 must equal 200, got shape [100,120,160] [[Node: magicpoint/eval_data_sharding/unstack_3 = UnpackT=DT_INT32, axis=0, num=200, _device="/job:localhost/replica:0/task:0/device:CPU:0"]] [[Node: magicpoint/eval_tower2/map/while/box_nms/non_max_suppression/NonMaxSuppressionV3/_1553 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:2", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1376_...pressionV3", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:2"]]

Does the code supports more than two GPUs? Thank you.

muk3250 avatar Sep 27 '18 01:09 muk3250

Theoretically, it should support more than 2 GPUs, but in practice I never tried.

What batch size did you use? The problem could come from the fact that this number is not divisible by the number of GPUs you used.

rpautrat avatar Sep 27 '18 09:09 rpautrat

Thank you for your reply. All default parameters are used. I used the batch size of 64 as mentioned in magic-point_shapes.yaml I tried the number GPUs from 2 to 8, but it only worked with 2 GPUs.

muk3250 avatar Sep 27 '18 12:09 muk3250

Ok, I think I have an explanation for your issue.

According to the error message, the issue lies in the evaluation, not in the training. If you indeed used the default parameters of configs/magic-point_shapes.yaml, then you have a validation set of size 500 and you use a batch size of 50 for the evaluation. Using 4 GPUs for example, the first step of the evaluation will use 50*4=200 images, the second step will use again 200 images and the last step will have only the 100 images which are left. But the code was expecting a batch of 200 images as before, hence the error.

So I would suggest a quick fix: change the size of the evaluation set (parameter data->validation_size in the config file) to a number that is divisible by eval_batch_size x num_GPUs. So for example if you want to keep a batch size (for the evaluation) of 50 and use 4 GPUs, then you can choose a validation set of size 600 (divisible by 50 x 4).

I hope it solves your issue.

rpautrat avatar Sep 29 '18 21:09 rpautrat

Has it solved your problem, @muk3250 ?

rpautrat avatar Oct 08 '18 13:10 rpautrat

Sorry for late reply. I have tried with different validation_size such that it is divisible by the number of GPUs but problem persist.

muk3250 avatar Oct 15 '18 09:10 muk3250

Hum, then I don't know what is causing this problem... The code has already been successfully used with more than 2 GPUs in the past, so I don't see why it shouldn't work with you.

I guess that in the meantime using 2 GPUs is already enough, training and prediction with SuperPoint are quite fast anyway.

rpautrat avatar Oct 15 '18 21:10 rpautrat

I tried to train the MagicPoint models by changing different "preprocessing->resize" setting in the file "magic-point_shapes.yaml", but the OOM happens all the way. I use 2 TITAN X with 12 GB VRAM. Please help me to change the settings so that my system can run. Thank you!

TienPhuocNguyen avatar Feb 27 '19 08:02 TienPhuocNguyen

What image size did you use? And what was your batch size? With those two GPUs you should be able to run MagicPoint without reaching an OOM, given a sufficiently low batch size.

rpautrat avatar Feb 27 '19 09:02 rpautrat

The image size was 120x160, the batch size was 64 as default. My GPUs consumed nearly 12GB VRAM with any settings.

TienPhuocNguyen avatar Feb 27 '19 09:02 TienPhuocNguyen

I see. Can you try with a lower batch size? I personally trained it with a batch size of 32, with a single GPU with 11 GB of memory, so it should at least work for you with a batch size of 32 I guess.

rpautrat avatar Feb 27 '19 10:02 rpautrat

Thank you for your response. I just changed the batch size to 32 and other parameters were kept as default. I set one GPU for display, another for training, but the error still happened. It took up 11725 MB VRAM.

"F tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) Aborted (core dumped)"

TienPhuocNguyen avatar Feb 27 '19 11:02 TienPhuocNguyen

But is this an OOM error? What is the full error message exactly?

rpautrat avatar Feb 27 '19 12:02 rpautrat

Problem was solved by downgrading my TF from 1.12 to 1.10 with CUDA 9.1. I am moving to the next steps. Thank you anyway!

TienPhuocNguyen avatar Feb 28 '19 00:02 TienPhuocNguyen

Ok, good to know!

rpautrat avatar Feb 28 '19 08:02 rpautrat

I personally trained it with a batch size of 32, with a single GPU with 11 GB of memory, so it should at least work for you with a batch size of 32 I guess.

@rpautrat Was this training setup specifically for the magicc point training or for the entire SuperPoint training as well? I just tried to start training last night and quickly realized my hardware will not be enough. Looking at different options and trying to figure out just how much VRAM I need.

mmmfarrell avatar Mar 06 '19 16:03 mmmfarrell

That was for MagicPoint only. For SuperPoint I had to reduce the batch size to 3 (with the same GPU) to avoid the OOM error.

rpautrat avatar Mar 06 '19 18:03 rpautrat

I trained Superpoint on MS-COCO and got memory leaks at step 6 with the default settings. It gradually consumed ~25GB of RAM and our system crashed. Changing the image resolution or batch size couldn't help. Do you guys have any idea for this problem? Thank you!

TienPhuocNguyen avatar Mar 07 '19 10:03 TienPhuocNguyen

It is weird that you have this increasing memory consumption, as if there was a memory leak... I have never observed that and I don't see where there could be such a leak.

When you said 25GB of RAM, you meant GPU memory, right?

One other possibility is to modify the structure itself of the network: instead of using patches of size 8x8, you can use bigger patches (like 16x16). Since most of the work is done with the downsampled images, it should consume less GPU memory that way. But you would need to adapt the rest of the code as well (for example the hyperparameter balancing the number of positive and negative matches in the descriptor loss) et and the performance might be a bit worse than the original network.

rpautrat avatar Mar 07 '19 12:03 rpautrat

No, it was RAM, not VRAM. The problem looks like this 53905719_2380025695341617_4495444287634276352_n

TienPhuocNguyen avatar Mar 07 '19 12:03 TienPhuocNguyen

Oh I see, then changing the batch size, the resolution or the size of the small patches won't help indeed.

Then you probably have the same issue as in https://github.com/rpautrat/SuperPoint/issues/22#issuecomment-455394267. I tried to suggest some methods to find a potential memory leak in this thread, but I don't know if @lidongjiangBJTU has solved the problem in the end.

rpautrat avatar Mar 07 '19 13:03 rpautrat

@rpautrat I'm not yet to the SuperPoint training yet, but is there a chance these memory leak problems are related to the cache_in_memory parameter? I notice that in the superpoint_coco.yaml the cache_in_memory parameter defaults to True while it is set to False in the other config files. I looked into the code, and it doesn't seem it should make a difference, but I thought I'd bring it up because I noticed this difference.

mmmfarrell avatar Mar 12 '19 01:03 mmmfarrell

Good point! Caching shouldn't be a problem normally because you cache it only once (and it should fit in memory), but there might be a bug with this somewhere. @TienPhuocNguyen, can you try to train SuperPoint without the parameter cache_in_memory set to False to see if it helps?

rpautrat avatar Mar 12 '19 08:03 rpautrat

Thank you guys, this change helps my system working. However, I have to reduce the training batch size and the evaluating batch size to avoid the OOM problem on GPU.

TienPhuocNguyen avatar Mar 12 '19 08:03 TienPhuocNguyen

Great! Yes, I guess that you have no other choice than reducing the batch size for the GPU, but it shouldn't impact too much the quality of your training.

rpautrat avatar Mar 12 '19 08:03 rpautrat

Sorry for late reply. I have tried with different validation_size such that it is divisible by the number of GPUs but problem persist.

Same here. May I ask did you solve the problem?

Le0000000000n avatar May 27 '19 02:05 Le0000000000n

Thank you for your reply. All default parameters are used. I used the batch size of 64 as mentioned in magic-point_shapes.yaml I tried the number GPUs from 2 to 8, but it only worked with 2 GPUs.

The current code only support training on 2 GPUS during the iterative steps between steps 2 and 3 in README.

stoneyang avatar Dec 02 '19 03:12 stoneyang