gluon-nlp
gluon-nlp copied to clipboard
Gluon NLP Horovod Issue
I was trying to use the bert model give here using SageMaker MXNet estimator and horovod and its giving me errors
script: https://github.com/dmlc/gluon-nlp/tree/v0.10.x/scripts/bert . I was using finetune_squad.py using the following code:
from sagemaker.mxnet import MXNet
import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
hyperparameters = {
'comm_backend':'horovod',
}
num_instances = 2 # How many nodes you want to use.
instance_family = 'ml.p3.2xlarge' # Which instance type you want to use.
source_name = 'finetune_squad.py'
distributions = {'mpi': {
'enabled': True,
'processes_per_host': 2, #Each instance has 8 gpus
'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO'
}
}
estimator = MXNet(
entry_point=source_name, #Script entry point.
source_dir='.', #Script Location
role=role,
train_instance_type=instance_family,
train_instance_count=num_instances,
framework_version='1.7.0', #MXNet version.
train_volume_size=10, #Size for the dataset.
py_version='py3', #Python version.
hyperparameters=hyperparameters,
distributions=distributions #For use with Horovod.
)
Description
(A clear and concise description of what the bug is.)
Error Message
(Paste the complete error message, including stack trace.)
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
# paste outputs here
Would you attach more details on how we may reproduce the issue?
And here is the error: