gluon-nlp icon indicating copy to clipboard operation
gluon-nlp copied to clipboard

Gluon NLP Horovod Issue

Open gauravrele87 opened this issue 4 years ago • 2 comments

I was trying to use the bert model give here using SageMaker MXNet estimator and horovod and its giving me errors

script: https://github.com/dmlc/gluon-nlp/tree/v0.10.x/scripts/bert . I was using finetune_squad.py using the following code:

from sagemaker.mxnet import MXNet
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()

role = sagemaker.get_execution_role()

hyperparameters = {
    'comm_backend':'horovod',
    }

num_instances = 2 # How many nodes you want to use.
instance_family = 'ml.p3.2xlarge' # Which instance type you want to use.
source_name = 'finetune_squad.py'

distributions = {'mpi': {
                    'enabled': True,
                    'processes_per_host': 2, #Each instance has 8 gpus
			'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO'
                        }
                }

estimator = MXNet(
                entry_point=source_name,         #Script entry point.
                source_dir='.',                #Script Location
                role=role, 
                train_instance_type=instance_family,
                train_instance_count=num_instances,
                framework_version='1.7.0',            #MXNet version.
                train_volume_size=10,                #Size for the dataset.
                py_version='py3',                     #Python version.
                hyperparameters=hyperparameters,
                distributions=distributions           #For use with Horovod.
)

Description

(A clear and concise description of what the bug is.)

Error Message

(Paste the complete error message, including stack trace.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

gauravrele87 avatar Oct 29 '20 20:10 gauravrele87

Would you attach more details on how we may reproduce the issue?

sxjscience avatar Oct 29 '20 20:10 sxjscience

And here is the error: image(4)

gauravrele87 avatar Oct 29 '20 20:10 gauravrele87