sagemaker-tensorflow-training-toolkit icon indicating copy to clipboard operation
sagemaker-tensorflow-training-toolkit copied to clipboard

pytest test/integration error

Open ChaiBapchya opened this issue 5 years ago • 4 comments
trafficstars

Test integration

pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu

Error stacktrace:

sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-horovod-1591768266-74da: Failed. Reason: Alg
orithmError: ExecuteUserScriptError:
E           Command "mpirun --host algo-1 -np 1 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tc
p_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_s
tatus 1 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/pyth
on3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERF
ACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x
 SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_O
UTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_D
IR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-10 05:55:55 Uploading - Uploading generated training model
2020-06-10 05:55:55 Failed - Training job failed
======================================================= short test summary info =======================================================
FAILED test/integration/sagemaker/test_horovod.py::test_distributed_training_horovod[gpu-3] - sagemaker.exceptions.UnexpectedStatusE...

ChaiBapchya avatar Jun 10 '20 08:06 ChaiBapchya

--docker-base-name sm-tf-horovod-integration --tag latest

what image did you use for your test run?

laurenyu avatar Jun 16 '20 21:06 laurenyu

Likely an image I build locally & pushed to ECR using the steps mentioned in the readme.

ChaiBapchya avatar Jun 17 '20 04:06 ChaiBapchya

running

pytest test/integration/sagemaker/test_horovod.py --account-id 763104351884 --docker-base-name tensorflow-training --tag 1.15.0-gpu-py3 --processor gpu --dockerfile-type dlc.gpu

produced

[ip-10-0-79-182.us-west-2.compute.internal:00039] 1 more process has sent help message help-orte-odls-default.txt / memory not bound
[ip-10-0-79-182.us-west-2.compute.internal:00039] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>:    "__main__", mod_spec)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/__main__.py", line 7, in <module>
[1,0]<stderr>:    main()
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 196, in main
[1,0]<stderr>:    run_command_line(args)
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 47, in run_command_line
[1,0]<stderr>:    run_path(sys.argv[0], run_name='__main__')
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 263, in run_path
[1,0]<stderr>:    pkg_name=pkg_name, script_name=fname)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,0]<stderr>:    mod_name, mod_spec, pkg_name, script_name)
[1,0]<stderr>:  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "horovod_mnist.py", line 46, in <module>
[1,0]<stderr>:    loss = tf.losses.SparseCategoricalCrossentropy()
[1,0]<stderr>:  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in __getattr__
[1,0]<stderr>:    attr = getattr(self._tfmw_wrapped_module, name)
[1,0]<stderr>:AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1061,1],1]
  Exit code:    1
--------------------------------------------------------------------------
2020-06-17 16:26:30,741 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1
 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/python3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME
-x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -
x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python3 -m mpi4py horovod_mnist.py --model_dir s3://sagemaker-us-west-2-583851319346/test-tf-horovod-15
92410946-69f0/model"
Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.

2020-06-17 16:26:39 Failed - Training job failed

which seems to match the partial stacktrace you provided. The actual error message looks to be:

AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'

which seems to have been a known bug in older versions of TF: https://github.com/tensorflow/tensorflow/issues/26007, https://github.com/tensorflow/tensorflow/issues/26012.

Running with TF 1.15.2 also failed, but running with TF 2.2 passed.

This makes me believe that the issue is with the TF installation rather than the code in this repository. I'll pass this along to the owners of https://github.com/aws/deep-learning-containers.

laurenyu avatar Jun 17 '20 17:06 laurenyu

Awesome. Thanks for redirecting to the concerned folks.

ChaiBapchya avatar Jun 18 '20 06:06 ChaiBapchya