sagemaker-tensorflow-training-toolkit
sagemaker-tensorflow-training-toolkit copied to clipboard
pytest test/integration error
Test integration
pytest test/integration/sagemaker/test_horovod.py --docker-base-name sm-tf-horovod-integration --tag latest --framework-version 1.15.0 --processor gpu
Error stacktrace:
sagemaker.exceptions.UnexpectedStatusException: Error for Training job test-tf-horovod-1591768266-74da: Failed. Reason: Alg
orithmError: ExecuteUserScriptError:
E Command "mpirun --host algo-1 -np 1 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tc
p_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_s
tatus 1 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/pyth
on3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERF
ACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x
SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_O
UTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_D
IR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.
2020-06-10 05:55:55 Uploading - Uploading generated training model
2020-06-10 05:55:55 Failed - Training job failed
======================================================= short test summary info =======================================================
FAILED test/integration/sagemaker/test_horovod.py::test_distributed_training_horovod[gpu-3] - sagemaker.exceptions.UnexpectedStatusE...
--docker-base-name sm-tf-horovod-integration --tag latest
what image did you use for your test run?
Likely an image I build locally & pushed to ECR using the steps mentioned in the readme.
running
pytest test/integration/sagemaker/test_horovod.py --account-id 763104351884 --docker-base-name tensorflow-training --tag 1.15.0-gpu-py3 --processor gpu --dockerfile-type dlc.gpu
produced
[ip-10-0-79-182.us-west-2.compute.internal:00039] 1 more process has sent help message help-orte-odls-default.txt / memory not bound
[ip-10-0-79-182.us-west-2.compute.internal:00039] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,0]<stderr>: "__main__", mod_spec)
[1,0]<stderr>: File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>: exec(code, run_globals)
[1,0]<stderr>: File "/usr/local/lib/python3.6/dist-packages/mpi4py/__main__.py", line 7, in <module>
[1,0]<stderr>: main()
[1,0]<stderr>: File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 196, in main
[1,0]<stderr>: run_command_line(args)
[1,0]<stderr>: File "/usr/local/lib/python3.6/dist-packages/mpi4py/run.py", line 47, in run_command_line
[1,0]<stderr>: run_path(sys.argv[0], run_name='__main__')
[1,0]<stderr>: File "/usr/lib/python3.6/runpy.py", line 263, in run_path
[1,0]<stderr>: pkg_name=pkg_name, script_name=fname)
[1,0]<stderr>: File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,0]<stderr>: mod_name, mod_spec, pkg_name, script_name)
[1,0]<stderr>: File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[1,0]<stderr>: exec(code, run_globals)
[1,0]<stderr>: File "horovod_mnist.py", line 46, in <module>
[1,0]<stderr>: loss = tf.losses.SparseCategoricalCrossentropy()
[1,0]<stderr>: File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in __getattr__
[1,0]<stderr>: attr = getattr(self._tfmw_wrapped_module, name)
[1,0]<stderr>:AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[1061,1],1]
Exit code: 1
--------------------------------------------------------------------------
2020-06-17 16:26:30,741 sagemaker-containers ERROR ExecuteUserScriptError:
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1
-x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/python3.6/dist-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0 -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME
-x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -
x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_HP_MODEL_DIR -x PYTHONPATH /usr/bin/python3 -m mpi4py horovod_mnist.py --model_dir s3://sagemaker-us-west-2-583851319346/test-tf-horovod-15
92410946-69f0/model"
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 23, in <module>
subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 1.
2020-06-17 16:26:39 Failed - Training job failed
which seems to match the partial stacktrace you provided. The actual error message looks to be:
AttributeError: module 'tensorflow._api.v1.losses' has no attribute 'SparseCategoricalCrossentropy'
which seems to have been a known bug in older versions of TF: https://github.com/tensorflow/tensorflow/issues/26007, https://github.com/tensorflow/tensorflow/issues/26012.
Running with TF 1.15.2 also failed, but running with TF 2.2 passed.
This makes me believe that the issue is with the TF installation rather than the code in this repository. I'll pass this along to the owners of https://github.com/aws/deep-learning-containers.
Awesome. Thanks for redirecting to the concerned folks.