gluon-nlp icon indicating copy to clipboard operation
gluon-nlp copied to clipboard

Code modify to make train_gnmt support multi gpus training

Open pengxin99 opened this issue 6 years ago • 17 comments

Description

Modify the train_gnmt.py to adapt the multi gpus environment. The change is according to the train_transformer.py which can training with multi gpus.

Checklist

Essentials

  • [x] Changes are complete (i.e. I finished coding on this PR)
  • [x] All changes have test coverage
  • [x] Code is well-documented

Changes

  • [ ] Feature1, tests, (and when applicable, API doc)
  • [ ] Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

pengxin99 avatar Nov 27 '18 02:11 pengxin99

@pengxin99 this is awesome. Thanks for the contribution! Looks like git wasn't able to do an auto-merge when checking out. Could you try the following for a rebase?

git remote add dmlc https://github.com/dmlc/gluon-nlp
pull dmlc master --rebase
git push --force

szha avatar Nov 27 '18 02:11 szha

@szha Thanks, and i will make appropriate modify.

pengxin99 avatar Nov 27 '18 02:11 pengxin99

@szhengac Thanks for review,

The loss need to be carefully rescaled, so that the final gradient is the average of the gradients w.r.t. tokens.

and i want to know does this mean every loss at each context should be average at global batch?

pengxin99 avatar Nov 29 '18 08:11 pengxin99

@pengxin99 Yes, the gradient needs to be averaged by the total number of tokens across all the GPUs.

szhengac avatar Nov 29 '18 12:11 szhengac

Codecov Report

Merging #435 into master will decrease coverage by 9.99%. The diff coverage is 36.39%.

@@            Coverage Diff            @@
##           master     #435     +/-   ##
=========================================
- Coverage   72.83%   62.84%    -10%     
=========================================
  Files         113      151     +38     
  Lines        9609    14003   +4394     
=========================================
+ Hits         6999     8800   +1801     
- Misses       2610     5203   +2593
Flag Coverage Δ
#PR431 ?
#PR435 63.87% <37.01%> (?)
#PR466 64.63% <37.01%> (?)
#PR588 89.46% <85.05%> (?)
#PR612 63.05% <36.86%> (?)
#PR639 63.65% <36.98%> (?)
#PR648 63.82% <37.01%> (?)
#master 63.78% <37.01%> (-12.36%) :arrow_down:
#notserial 39.27% <24.95%> (-10.94%) :arrow_down:
#py2 63.64% <36.21%> (-8.92%) :arrow_down:
#py3 62.72% <36.36%> (-9.97%) :arrow_down:
#serial 49.4% <24.37%> (-8.14%) :arrow_down:

codecov[bot] avatar Dec 06 '18 14:12 codecov[bot]

Job PR-435/9 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/9/index.html

mli avatar Dec 06 '18 15:12 mli

@szhengac @sxjscience could you please review the lasted commit if these changes work well, thanks~

pengxin99 avatar Dec 11 '18 03:12 pengxin99

Looks good. I'll approve after I confirmed that the results are the same.

sxjscience avatar Dec 13 '18 06:12 sxjscience

Job PR-435/13 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/13/index.html

mli avatar Dec 31 '18 10:12 mli

@szha could you please help me found what is the problem or where did i go wrong? Thanks~ I use the new clip gradient (pr #470) for this multi GPUs training, but i got this error :

Parameter 'gnmt_enc_rnn0_l_i2h_weight' was not initialized on context cpu(0). It was only initialized on [gpu(0), gpu(1), gpu(2), gpu(3)]

i think it maybe the same as Mxnet issue Cannot call data() on parameters initialized on multiple GPUs

main code

  • parameter init code
context = [mx.cpu()] if args.gpus is None or args.gpus == '' else \
    [mx.gpu(int(x)) for x in args.gpus.split(',')]
ctx = context[0]

# load GNMT model
encoder, decoder = get_gnmt_encoder_decoder(hidden_size=args.num_hidden,
                                            dropout=args.dropout,
                                            num_layers=args.num_layers,
                                            num_bi_layers=args.num_bi_layers,
                                            attention_cell=args.attention_type)

model = NMTModel(src_vocab=src_vocab, tgt_vocab=tgt_vocab, encoder=encoder, decoder=decoder,
                 embed_size=args.num_embedding, prefix='gnmt_')


model.initialize(init=mx.init.Uniform(0.1), ctx=context)
static_alloc = True
model.hybridize(static_alloc=static_alloc)
  • call clip_grad_global_norm code
            with mx.autograd.record():
                for src_seq, tgt_seq, src_valid_length, tgt_valid_length in seqs:
                    out, _ = model(src_seq, tgt_seq[:, :-1], src_valid_length, tgt_valid_length - 1)

                    ls = loss_function(out, tgt_seq[:, 1:], tgt_valid_length - 1).sum()
                    LS.append(ls * (tgt_seq.shape[1] - 1) / float(loss_denom))
            for L in LS:
                L.backward()

            # grads = [p.grad(c) for p in model.collect_params().values() for c in context]
            # gnorm = clip_global_norm(grads, args.clip, context_len=len(context))
            trainer.allreduce_grads()
            gnorm = clip_grad_global_norm(model.collect_params().values(), args.clip)
            trainer.update(args.batch_size)

            log_avg_gnorm += gnorm

pengxin99 avatar Jan 02 '19 14:01 pengxin99

@pengxin99 the problem was that the parameter's get method is relying on current_context when calling grad() without argument. I will post a fix in GluonNLP first and see how to resolve it upstream.

szha avatar Jan 06 '19 20:01 szha

#527

szha avatar Jan 06 '19 20:01 szha

@szha Thanks, and I will test GNMT performancem with multi GPUS , and then update this pr as soon as possible.

pengxin99 avatar Jan 07 '19 08:01 pengxin99

Job PR-435/14 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/14/index.html

mli avatar Jan 12 '19 02:01 mli

Job PR-435/15 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/15/index.html

mli avatar Jan 13 '19 01:01 mli

Job PR-435/16 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/16/index.html

mli avatar Jan 30 '19 13:01 mli

@eric-haibin-lin will help drive this PR forward. cc @pengxin99

szha avatar Feb 18 '19 05:02 szha