deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] Using a Pytorch image seems to be causing an ArgParser bug -- "bash: cannot set terminal process group (-1): Inappropriate ioctl for device"

Open VictorJouault opened this issue 3 years ago • 2 comments

Concise Description: When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on.

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04

Current behavior:

When using a Pytorch container (see below), I see a strange behavior, which seems to be causing ArgParser issues later on. At the very beginning of the job, the message below is printed (linked to this issue).

bash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
bash: no job control in this shell

This seems to be creating a bug when I try to use the ArgParser to get the hyperparameters to my model. While the ArgParser works with other images, it creates the following bug when using Pytorch images:

[34mTraceback (most recent call last):
  File "experiment.py", line 358, in <module>
    args.hyper_params = json.loads(args.hyper_params)
  File "/opt/conda/lib/python3.8/json/__init__.py", line 357, in loads[0m
[34mreturn _default_decoder.decode(s)
  File "/opt/conda/lib/python3.8/json/decoder.py", line 337, in decode[0m
[34mobj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.8/json/decoder.py", line 353, in raw_decode[0m
[34mobj, end = self.scan_once(s, idx)[0m
[34mjson.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)[0m
[34m2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR    Reporting training FAILURE[0m
[34m2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:[0m
[34mExitCode 1[0m
[34mErrorMessage ""[0m
[34mCommand "/opt/conda/bin/python3.8 experiment.py --data-bucket sagemaker-us-east-1-XXX --data-prefix sample_dataset --estimator CustomEstimator --hyper-params {"prediction_length": 168, "context_length": 672, "trainer_kwargs": {"max_epochs": 200}} --job-config {}"[0m

Expected behavior:

Additional context: I am mainly looking for help on how to run my model on Sagemaker. Currently, my script requires both MXNet and Pytorch because I am using GluonTS. When using a Pytorch image, I run into this bug. When running an MXNet image, I run into Horovod error even though the image I use (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.8.0-gpu-py37-cu110-ubuntu16.04) should be Horovod compatible.

Any suggestion appreciated, thanks!

VictorJouault avatar Jan 03 '22 22:01 VictorJouault

I am facing the same issue with 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04

dustin-liu-bgl avatar May 25 '22 06:05 dustin-liu-bgl

Experiencing a similar behaviour on 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py3.8-cu111-ubuntu20.04. Seems like the shell is not correctly configured in this image.

We're using the above image as a base image for a custom docker container. In the Dockerfile we specify a default command of the final container via CMD some_command param1 param2 (shell form). Due to the misconfigured shell in the aws image, running docker run [args] $CONTAINER will lead to erronous executions of the configured command. Using CMD ["some_command", "param1", "param2"] in the Dockerfile (exec form) can avoid the erronous execution but still prints the bash error messages.

lckr avatar Jul 12 '22 12:07 lckr

Hi, we no longer support PyTorch 1.10 DLCs. We recommend upgrading to later PyTorch DLCs, see available_images.md for more information.

Feel free to reopen the ticket if issue is still observed.

sirutBuasai avatar Mar 27 '24 00:03 sirutBuasai