Michael Denkowski

Results 27 comments of Michael Denkowski

Hi Samuel, It looks like the error is shortly after Sockeye reloads the best checkpoint. There's a [reported issue](https://github.com/pytorch/pytorch/issues/80809) for PyTorch 1.12.0 where loading checkpoints causes this type of error....

> re: Naming -- Maybe let's do "store"? This establishes some connection with the original paper's notion of "create a datastore" (but IMO "datastore" is too vague of a concept...

The benchmarks in the paper run a WMT17 En-De big transformer with batch size 1 on a c5.2xlarge EC2 instance. Differences in any of these dimensions can lead to different...

Thanks Tobi and Felix! I've made some improvements based on your feedback.

Hi Vincent, Sockeye saves the training state separately from the model parameters. The `training_state` directory contains state files for the optimizer, data iterator, etc. At each checkpoint, Sockeye saves the...

Thanks for sharing these settings! With the updated frequencies, OpenNMT's vocabulary size matches Sockeye's. We set the batch size for each toolkit to leave enough free GPU memory to avoid...

It looks like using "noam" decay with learning rate "2" gives us the right learning schedule. We'll run a benchmark with the updated settings.

Hi Vincent, Running with your recommended settings results in faster training and higher BLEU scores: - WMT17 En-De: 13.7 hours, 35.2 BLEU - WMT17 Ru-En: 39.4 hours, 32.2 BLEU We'll...

Yes, we've updated the config file to include all of your recommendations including batch size 5000 and update interval 10. This is the log for the WMT17 En-De model: [onmt_train_wmt17_en_de.log](https://github.com/awslabs/sockeye/files/9182735/onmt_train_wmt17_en_de.log).

We've updated the paper with the new results: [Sockeye 3: Fast Neural Machine Translation with PyTorch](https://arxiv.org/abs/2207.05851).