sockeye icon indicating copy to clipboard operation
sockeye copied to clipboard

About the Arxiv benchmark and the paper

Open vince62s opened this issue 1 year ago • 12 comments

Hi guys,

I would like to report a "bias" in your benchmark from the last paper. I tested those scripts https://github.com/awslabs/sockeye/tree/arxiv_sockeye3/arxiv The thing is that the 3 toolkit save the checkpoint differently (as an example Sockeye does not save the optimizer, we at ONMT save it). The file is 1.1GB for sockeye, 3.0GB for onmt (in between for FS).

If you turn the valid / checkpoint save to 5000 updates instead of 500 updates, then you will get on a run of 10000 updates: FS: 11m30 Sockeye: 12m11 ONMT: 13m30 run on a single GPU (3070 ti)

Indeed ONMT is about 10% slower but not 60%, btw FS is a bit faster than Sockeye. The paper is misleading on these numbers.

Not an explanation but the vocab size is wrong because of a mismatch in min_frequency that needs to be set at 3 (not 2 because it adds up both src and tgt counts).

For inference, I strongly suggest you look at https://github.com/OpenNMT/CTranslate2 which is our focus, rather than trying to get faster performance on Pytorch.

Last but not least, we will try to figure out why there is a drift in BLEU for RU-EN.

cc: @guillaumekln @francoishernandez @mjdenkowski

Cheers,

Vincent

EDIT: ONMT 12m25 after the fix of valid batch_size, so technically all of these are equivalent.

vince62s avatar Jul 19 '22 20:07 vince62s

Hi Vincent,

Sockeye saves the training state separately from the model parameters. The training_state directory contains state files for the optimizer, data iterator, etc. At each checkpoint, Sockeye saves the current training state and cleans up (deletes) the previous training state. You can check the total size of Sockeye's last checkpoint with: du -ch $(readlink -f training_state/params) training_state.

For the training benchmark, we measure end-to-end time for a recipe that saves checkpoints often enough to have a good pool of parameters files to average. Using different settings to measure update speed and checkpoint saving speed independently is an interesting idea.

One goal of the benchmark is to use settings that are as close as possible across toolkits. If you can share better settings for OpenNMT's vocabulary, that would be great!

It looks like you're getting some impressive results with CTranslate2! Our inference benchmark focuses on implementations that can be extended at the Python level for research experiments while still maintaining the benefits of an optimized codebase. It's a good point that compatible models trained with any of the PyTorch toolkits could be converted and run with optimized CPU/GPU inference implementations.

Best, Michael

mjdenkowski avatar Jul 19 '22 22:07 mjdenkowski

Like Marian, we have a mechanism to average models on the fly (average_decay option) that's why we don't save models so often, but I am curious to understand why the saving time is SO different (maybe pickle.dump is faster than torch.save). Another thing is we do not preprocess data, we "transform" on the fly. When you have a huge dataset, it makes a point. Anyway all of this raises some issues for improvement. Cheers.

NB: did you use for onmt 3584 x 14 instead of 5000 x 10 for the batch size because it did not fit in memory ?

vince62s avatar Jul 20 '22 07:07 vince62s

okay in fact there is an issue with the valid batch size (8 tokens when batch_type=tokens, wheras it used to be 8 sentences in the past - we need to update the doc). So validation (not saving) takes way too long because it processes 1 sentence by 1 sentence. Please update your big.yaml file as follow:

src_words_min_frequency: 3 tgt_words_min_frequency: 3 this will set the vocab at the same size as FS/Sockeye

valid_batch_size: 2048

as per my previous comment batch_size: 3584 accum_count: [14] should be batch_size: 5000 accum_count: [10] if this fits to memory to be comparable with other toolkits.

vince62s avatar Jul 20 '22 09:07 vince62s

Thanks for sharing these settings!

With the updated frequencies, OpenNMT's vocabulary size matches Sockeye's.

We set the batch size for each toolkit to leave enough free GPU memory to avoid overflow from variable usage during training. With a batch size of 5000, Sockeye stabilizes to using around 75%, Fairseq up to 93%, and OpenNMT nearly 100%. If that's not a risk for OpenNMT, a batch size of 5000 may speed up training.

We'll run a WMT benchmark with your settings.

mjdenkowski avatar Jul 20 '22 16:07 mjdenkowski

Michael, When looking again at the config there is still a discrepancy that can justify the BLEU difference. When you set rsqrt in ONMT there is no linear increase from step 0 until warmup steps. it's flat and obviously too high for fusedadam triggering a lot of grad overflow. I suggest you set "noam" and "lr=2" which will be comparable to your sockeye invsqrt scheduler.

Hope it's clear.

vince62s avatar Jul 21 '22 08:07 vince62s

It looks like using "noam" decay with learning rate "2" gives us the right learning schedule.

We'll run a benchmark with the updated settings.

mjdenkowski avatar Jul 22 '22 17:07 mjdenkowski

Hi Vincent,

Running with your recommended settings results in faster training and higher BLEU scores:

  • WMT17 En-De: 13.7 hours, 35.2 BLEU
  • WMT17 Ru-En: 39.4 hours, 32.2 BLEU

We'll update the paper with these values and add you to the acknowledgements section. Thank you for helping us run an accurate benchmark!

Best, Michael

mjdenkowski avatar Jul 25 '22 14:07 mjdenkowski

the wall time looks still high, were you able to run with a batch size of 5000 update 10 ? did you keep the log ?

vince62s avatar Jul 25 '22 15:07 vince62s

Yes, we've updated the config file to include all of your recommendations including batch size 5000 and update interval 10.

This is the log for the WMT17 En-De model: onmt_train_wmt17_en_de.log.

mjdenkowski avatar Jul 25 '22 15:07 mjdenkowski

We've updated the paper with the new results: Sockeye 3: Fast Neural Machine Translation with PyTorch.

mjdenkowski avatar Jul 27 '22 01:07 mjdenkowski

Great thanks. It seems that Torchscript brings the 5-10% improvement on 1 GPU, but I am unsure about the big gap on 8 GPUs. We use torch.distributed as well so maybe we need to try Torchscript as well. FYI, I've been unable to get a p3.8xlarge instance on us-east-1 for the past 3 days on a brand new account, even with a 32 limit.

"We currently do not have sufficient p3.8xlarge capacity in zones with support for 'gp3' volumes. Our system will be working on provisioning additional capacity."

So not easy to benchmark.... you guys may have better privileges... :)

vince62s avatar Jul 27 '22 07:07 vince62s

Small note: there seems to be a misconfiguration in your hyperref setup that is causing links in the references section not to wrap.

image

mjpost avatar Jul 27 '22 12:07 mjpost

Before closing this issue, adding two more comments:

  1. testsets are WMT16 and not WMT17 (at least for EN-DE), let's not confuse people thinking you got 35 BLEU on WMT17
  2. for onmt, in order to be comparable length_penalty needs to be set to "avg" to replicate FS behavior. I just PRed onmt-py to set this by default. Cheers.

vince62s avatar Sep 05 '22 15:09 vince62s