darts icon indicating copy to clipboard operation
darts copied to clipboard

train_search on multi-gpus

Open JaminFong opened this issue 7 years ago • 19 comments

Hello, quark! Thx for your great work. When I tried to run your train_search job with multi-gpus, the Variable of alphas_normal and alphas_reduce causes errors. The errors are shown as following:

File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 111, in forward s0, s1 = s1, cell(s0, s1, weights) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in forward s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 54, in <genexpr> s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) File "/mnt/data-3/data/jiemin.fang/anaconda3/envs/pytorch4/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__ result = self.forward(*input, **kwargs) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in forward return sum(w * op(x) for w, op in zip(weights, self._ops)) File "/mnt/data-3/data/jiemin.fang/darts-maml/cnn/model_search.py", line 22, in <genexpr> return sum(w * op(x) for w, op in zip(weights, self._ops)) RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:314 For debugging the code, I tried to remove 'w' which is from alphas_normal or alphas_reduce in return sum(w * op(x) for w, op in zip(weights, self._ops)) Both 0.3 and 0.4 version of PyTorch have been tried, but the problem got no improvement. Could you please tell me how I can deal with the multi-gpu training work? And have you ever met any similar problem like this? Best and waiting for your reply!

JaminFong avatar Aug 27 '18 13:08 JaminFong

That's because the arch_parameters are not being copied onto every GPU. DataParallel only copies parameters and buffers of a module to all GPUs. In the above code, the arch_parameters are Variables and as a result, they do not get copied, hence the error. You can try making them parameters, but then you will have to override the parameters() function so that only weight parameters are returned, and not the arch parameters.

However, DataParallel will not give you any speed up in this case. It will in fact be very slow. This is because copying over the modules before every forward will take a loooot of time. There are around 5000 nested modules in the search network, whereas a large network like ResNet-101 has less then 400. This overhead will wipe out any possible benefit of data parallelization.

arunmallya avatar Aug 30 '18 16:08 arunmallya

@arunmallya Thanks for your reply. And I'll have a try as you suggested. For the speed of of DataParallel, I think it may not help when the network is tiny, but I wanna apply darts to larger networks on larger datasets. By this way, only one gpu may not be able to afford the work.

JaminFong avatar Sep 03 '18 11:09 JaminFong

@arunmallya I agree with your points. Several people asked about this, but I haven't got the chance to try it myself.

@JaminFong An alternative approach is to further reduce the batch size/number of channels during search, though this might lead to some additional discrepancies between search & evaluation.

quark0 avatar Sep 03 '18 16:09 quark0

@quark0 Yes, I tried to reduce the number of layers or img size to fit my experiment within one gpu. But I think if we want to extend darts to larger scale task, data parallel may be necessary. Best!

JaminFong avatar Sep 03 '18 16:09 JaminFong

@JaminFong Hi, have you tried to implement it with multi-gpu? I am also going to search on a large task but one gpu is limited. Thanks.

VectorYoung avatar Feb 11 '19 16:02 VectorYoung

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

JaminFong avatar Feb 12 '19 04:02 JaminFong

@JaminFong ,hi, thanks for your work. But when i run multi-gpu, it comes with this error below, have you met before?

logits = self.model(input_valid) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 69, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 80, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 38, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 31, in scatter return scatter_map(inputs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 18, in scatter_map return list(zip(*map(scatter_map, obj))) File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/scatter_gather.py", line 16, in scatter_map assert not torch.is_tensor(obj), "Tensors not supported in scatter." AssertionError: Tensors not supported in scatter.

QiuPaul avatar Feb 26 '19 06:02 QiuPaul

@QiuPaul Hi, do u use the version of pytorch 0.3? If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

JaminFong avatar Feb 26 '19 06:02 JaminFong

@QiuPaul Hi, do u use the version of pytorch 0.3? If yes, it is best to use pytorch 1.0 instead or pytorch >=0.4 at least.

@JaminFong , yeah, thanks very much for your advice, it can run with pytorch1.0. What's more important , i found in your code, you make some modification: optimizer = torch.optim.SGD( weight_params, #model.parameters(), args.learning_rate, momentum=args.momentum, weight_decay=args.weight_decay)

Do you also find this problem below?? thanks... https://github.com/quark0/darts/issues/75

In paper , while not converged do

  1. Update weights w by descending GRADw(w; alpha)
  2. Update architecture alpha by descending GRADalpha(updated w; alpha) Which means: when update weights ,the alpha is fixed. However in original code below, when use momentum to update the weights, all parameters model.parameters() including arch_parameters are updated, waiting to be confirmed ,thanks....

optimizer = torch.optim.SGD( model.parameters(), args.learning_rate, momentum=args.momentum, weight_decay=args.weight_decay)

QiuPaul avatar Mar 01 '19 08:03 QiuPaul

@QiuPaul In the original code, architecture parameters (alphas_normal and alphas_reduce) are not in model.parameters(). https://github.com/quark0/darts/blob/f276dd346a09ae3160f8e3aca5c7b193fda1da37/cnn/model_search.py#L123 Therefore, there is no need to filter the parameters in the orginal code. Please refer to the running mechanism of PyTorch.

JaminFong avatar Mar 01 '19 09:03 JaminFong

@VectorYoung You could refer to https://github.com/JaminFong/darts-multi_gpu. I have implemented a multi-gpu one for the first order version.

Thanks so much for your work. I have run your code, but it seems that there is a problem according to the result (below figure) I get.

I run the code on Titan gpu and the batch size is 64.

  • Fw means forward time for each step
  • Bw means backward time for each step
  • Up means up time for each step, i.e. optimier.step()
  • Arch means the time of training archtecture for each step

image

The problem is that multi-gpus run even slower than single gpu.

The running info of gpus are as follows: image image

marsggbo avatar Apr 17 '19 07:04 marsggbo

@marsggbo When using multi-gpu running, DataParallel in pytorch will take much more time to copy data into all the expected nodes before forward operations, especially the number of modules in the darts network is much larger than normal ones. So when your batch size is small it may not speed up the running to apply the model on multi gpus.

JaminFong avatar Apr 27 '19 08:04 JaminFong

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

killawhale2 avatar Jun 21 '19 04:06 killawhale2

@JaminFong I've looked at your implementation and only found instructions for running the 2nd order version of the algorithm. Could you specify the instructions on running the algorithm on just the first order?

--unrolled False ?

Margrate avatar Jun 29 '19 10:06 Margrate

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Network: Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

xjtuzll avatar Jul 11 '19 14:07 xjtuzll

@JaminFong I run multi-gpu code on Titan gpu and the batch size is 64. It comes with this error below, have you met before?

File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 769, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Network: Missing key(s) in state_dict: "alphas_reduce", "alphas_normal", "stem.0.weight", "stem.1.running_var", "stem.1.bias", "stem.1.weight", "stem.1.running_mean", "cells.0.preprocess0.op.1.weight", "cells.0.preprocess0.op.2.running_var", "cells.0.preprocess0.op.2.running_mean", "cells.0.preprocess1.op.1.weight", "cells.0.preprocess1.op.2.running_var", "cells.0.preprocess1.op.2.running_mean", "cells.0._ops.0._ops.1.1.running_var", "cells.0._ops.0._ops.1.1.running_mean", "cells.0._ops.0._ops.2.1.running_var", "cells.0._ops.0._ops.2.1.running_mean", "cells.0._ops.0._ops.4.op.1.weight", "cells.0._ops.0._ops.4.op.2.weight"......

When u load the model from multi-gpu ones (or data_parallel ...), the params may come as module.***. You need to check the key names of the params dict.

JaminFong avatar Jul 12 '19 03:07 JaminFong

Can train.py run on muti-gpus? drop_path is not supported?

Margrate avatar Jul 16 '19 03:07 Margrate

First-order approximation approach leads to worst performance comparing with Second-order approximation.

giangtranml avatar Nov 04 '19 09:11 giangtranml

I have implemented a distributed PC-Darts for the first order version,https://github.com/bitluozhuang/Distributed-PC-Darts.Welcome to try it.

bitluozhuang avatar Dec 26 '19 10:12 bitluozhuang