deep-learning-containers
deep-learning-containers copied to clipboard
[bug] Using a Pytorch image seems to be causing an ArgParser bug -- "bash: cannot set terminal process group (-1): Inappropriate ioctl for device"
Concise Description: When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on.
DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
Current behavior:
When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on. At the very beginning of the job, the message below is printed (linked to this issue).
bash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
bash: no job control in this shell
This seems to be creating a bug when I try to use the ArgParser to get the hyperparameters to my model. While the ArgParser works with other images, it creates the following bug when using Pytorch images:
[34mTraceback (most recent call last):
File "experiment.py", line 358, in <module>
args.hyper_params = json.loads(args.hyper_params)
File "/opt/conda/lib/python3.8/json/__init__.py", line 357, in loads[0m
[34mreturn _default_decoder.decode(s)
File "/opt/conda/lib/python3.8/json/decoder.py", line 337, in decode[0m
[34mobj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/conda/lib/python3.8/json/decoder.py", line 353, in raw_decode[0m
[34mobj, end = self.scan_once(s, idx)[0m
[34mjson.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)[0m
[34m2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR Reporting training FAILURE[0m
[34m2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR ExecuteUserScriptError:[0m
[34mExitCode 1[0m
[34mErrorMessage ""[0m
[34mCommand "/opt/conda/bin/python3.8 experiment.py --data-bucket sagemaker-us-east-1-XXX --data-prefix sample_dataset --estimator CustomEstimator --hyper-params {"prediction_length": 168, "context_length": 672, "trainer_kwargs": {"max_epochs": 200}} --job-config {}"[0m
Expected behavior:
Additional context:
I am mainly looking for help on how to run my model on Sagemaker. Currently, my script requires both MXNet and Pytorch because I am using GluonTS. When using a Pytorch image, I run into this bug. When running an MXNet image, I run into Horovod error even though the image I use (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.8.0-gpu-py37-cu110-ubuntu16.04
) should be Horovod compatible.
Any suggestion appreciated, thanks!
I am facing the same issue with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
Experiencing a similar behaviour on 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py3.8-cu111-ubuntu20.04
.
Seems like the shell is not correctly configured in this image.
We're using the above image as a base image for a custom docker container.
In the Dockerfile we specify a default command of the final container via CMD some_command param1 param2
(shell form).
Due to the misconfigured shell in the aws image, running docker run [args] $CONTAINER
will lead to erronous executions of the configured command.
Using CMD ["some_command", "param1", "param2"]
in the Dockerfile (exec form) can avoid the erronous execution but still prints the bash error messages.
Hi, we no longer support PyTorch 1.10 DLCs. We recommend upgrading to later PyTorch DLCs, see available_images.md for more information.
Feel free to reopen the ticket if issue is still observed.