gluon-nlp
gluon-nlp copied to clipboard
Code modify to make train_gnmt support multi gpus training
Description
Modify the train_gnmt.py to adapt the multi gpus environment. The change is according to the train_transformer.py which can training with multi gpus.
Checklist
Essentials
- [x] Changes are complete (i.e. I finished coding on this PR)
- [x] All changes have test coverage
- [x] Code is well-documented
Changes
- [ ] Feature1, tests, (and when applicable, API doc)
- [ ] Feature2, tests, (and when applicable, API doc)
Comments
- If this change is a backward incompatible change, why must this change be made.
- Interesting edge cases to note here
@pengxin99 this is awesome. Thanks for the contribution! Looks like git wasn't able to do an auto-merge when checking out. Could you try the following for a rebase?
git remote add dmlc https://github.com/dmlc/gluon-nlp
pull dmlc master --rebase
git push --force
@szha Thanks, and i will make appropriate modify.
@szhengac Thanks for review,
The loss need to be carefully rescaled, so that the final gradient is the average of the gradients w.r.t. tokens.
and i want to know does this mean every loss at each context should be average at global batch?
@pengxin99 Yes, the gradient needs to be averaged by the total number of tokens across all the GPUs.
Codecov Report
Merging #435 into master will decrease coverage by
9.99%
. The diff coverage is36.39%
.
@@ Coverage Diff @@
## master #435 +/- ##
=========================================
- Coverage 72.83% 62.84% -10%
=========================================
Files 113 151 +38
Lines 9609 14003 +4394
=========================================
+ Hits 6999 8800 +1801
- Misses 2610 5203 +2593
Flag | Coverage Δ | |
---|---|---|
#PR431 | ? |
|
#PR435 | 63.87% <37.01%> (?) |
|
#PR466 | 64.63% <37.01%> (?) |
|
#PR588 | 89.46% <85.05%> (?) |
|
#PR612 | 63.05% <36.86%> (?) |
|
#PR639 | 63.65% <36.98%> (?) |
|
#PR648 | 63.82% <37.01%> (?) |
|
#master | 63.78% <37.01%> (-12.36%) |
:arrow_down: |
#notserial | 39.27% <24.95%> (-10.94%) |
:arrow_down: |
#py2 | 63.64% <36.21%> (-8.92%) |
:arrow_down: |
#py3 | 62.72% <36.36%> (-9.97%) |
:arrow_down: |
#serial | 49.4% <24.37%> (-8.14%) |
:arrow_down: |
Job PR-435/9 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/9/index.html
@szhengac @sxjscience could you please review the lasted commit if these changes work well, thanks~
Looks good. I'll approve after I confirmed that the results are the same.
Job PR-435/13 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/13/index.html
@szha could you please help me found what is the problem or where did i go wrong? Thanks~ I use the new clip gradient (pr #470) for this multi GPUs training, but i got this error :
Parameter 'gnmt_enc_rnn0_l_i2h_weight' was not initialized on context cpu(0). It was only initialized on [gpu(0), gpu(1), gpu(2), gpu(3)]
i think it maybe the same as Mxnet issue Cannot call data() on parameters initialized on multiple GPUs
main code
- parameter init code
context = [mx.cpu()] if args.gpus is None or args.gpus == '' else \
[mx.gpu(int(x)) for x in args.gpus.split(',')]
ctx = context[0]
# load GNMT model
encoder, decoder = get_gnmt_encoder_decoder(hidden_size=args.num_hidden,
dropout=args.dropout,
num_layers=args.num_layers,
num_bi_layers=args.num_bi_layers,
attention_cell=args.attention_type)
model = NMTModel(src_vocab=src_vocab, tgt_vocab=tgt_vocab, encoder=encoder, decoder=decoder,
embed_size=args.num_embedding, prefix='gnmt_')
model.initialize(init=mx.init.Uniform(0.1), ctx=context)
static_alloc = True
model.hybridize(static_alloc=static_alloc)
- call clip_grad_global_norm code
with mx.autograd.record():
for src_seq, tgt_seq, src_valid_length, tgt_valid_length in seqs:
out, _ = model(src_seq, tgt_seq[:, :-1], src_valid_length, tgt_valid_length - 1)
ls = loss_function(out, tgt_seq[:, 1:], tgt_valid_length - 1).sum()
LS.append(ls * (tgt_seq.shape[1] - 1) / float(loss_denom))
for L in LS:
L.backward()
# grads = [p.grad(c) for p in model.collect_params().values() for c in context]
# gnorm = clip_global_norm(grads, args.clip, context_len=len(context))
trainer.allreduce_grads()
gnorm = clip_grad_global_norm(model.collect_params().values(), args.clip)
trainer.update(args.batch_size)
log_avg_gnorm += gnorm
@pengxin99 the problem was that the parameter's get method is relying on current_context when calling grad() without argument. I will post a fix in GluonNLP first and see how to resolve it upstream.
#527
@szha Thanks, and I will test GNMT performancem with multi GPUS , and then update this pr as soon as possible.
Job PR-435/14 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/14/index.html
Job PR-435/15 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/15/index.html
Job PR-435/16 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-435/16/index.html
@eric-haibin-lin will help drive this PR forward. cc @pengxin99