deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] Detectron2 errors when installing on PyTorch DLC

Open austinmw opened this issue 2 years ago • 9 comments

Checklist

  • [X] I've prepended issue tag with type of change: [bug]
  • [X] (If applicable) I've attached the script to reproduce the bug
  • [X] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [X] (If applicable) I've documented below the tests I've run on the DLC image
  • [X] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:

Detectron2 errors when being installed on top of pytorch-training container. It appears to be related to smdebug.

How to reproduce:

> nvidia-docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
> pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
> python -c "from detectron2 import model_zoo"

DLC image/dockerfile:

763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker

Current behavior:

Traceback:

root@fe0954d71a8e:/# python -c "from detectron2 import model_zoo" Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/init.py", line 8, in from .model_zoo import get, get_config_file, get_checkpoint_url, get_config File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/model_zoo.py", line 9, in from detectron2.modeling import build_model File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/init.py", line 2, in from detectron2.layers import ShapeSpec File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/init.py", line 2, in from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/batch_norm.py", line 4, in from fvcore.nn.distributed import differentiable_all_reduce File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/init.py", line 4, in from .focal_loss import ( File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 52, in sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss) File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script fn = torch._C._jit_script_compile( File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 838, in try_compile_fn return torch.jit.script(fn, _rcb=rcb) File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script fn = torch._C._jit_script_compile( RuntimeError: undefined value has_torch_function_variadic: File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962 >>> loss.backward() """ if has_torch_function_variadic(input, target, weight, pos_weight): ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return handle_torch_function( binary_cross_entropy_with_logits, 'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss' File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36 targets = targets.float() p = torch.sigmoid(inputs) ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE p_t = p * targets + (1 - p) * (1 - targets) loss = ce_loss * ((1 - p_t) ** gamma)

Expected behavior:

No error on import

austinmw avatar Mar 24 '22 16:03 austinmw

im getting the same error. @austinmw did you ever resolve this issue?

d-v-dlee avatar May 09 '22 21:05 d-v-dlee

I've had the same issue until I found this https://github.com/aws-samples/amazon-sagemaker-pytorch-detectron2/issues/8 It seems like you need to extend the DLC with the official torch and torchvision packages

salmenhsairi avatar May 09 '22 21:05 salmenhsairi

@salmenhsairi That is a known workaround, but really you shouldn't need to uninstall and reinstall torch.

austinmw avatar May 09 '22 21:05 austinmw

@austinmw Unless there's another method to upgrade the dlc existing torch version which is optimized, as detectron2 requires the complete one.

salmenhsairi avatar May 09 '22 22:05 salmenhsairi

@salmenhsairi There is not currently. My point is that it would be ideal for the SageMaker version of Torch to not be modified in a way that breaks compatibility with other libraries.

austinmw avatar May 09 '22 22:05 austinmw

I'm using the huggingface container: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10-transformers4.17-gpu-py38-cu113-ubuntu20.04

I extended the Huggingface container with the following commands:

RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

RUN python -m pip install detectron2 -f \
  https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

The interesting thing is that for the Huggingface container, it says it already is up-to-date with Torch and Torchvision packages.

image

Got a similar error as Austin.

RuntimeError: 
undefined value has_torch_function_variadic:
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
         >>> loss.backward()
    """
    if has_torch_function_variadic(input, target, weight, pos_weight):
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return handle_torch_function(
            binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
  File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
    targets = targets.float()
    p = torch.sigmoid(inputs)
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")

Guess ill try the Pytorch container instead...

d-v-dlee avatar May 09 '22 23:05 d-v-dlee

@d-v-dlee I don't think it would show being out of date; the version still matches. The problem is that it's been modified. You could uninstall and reinstall torch, though in your case there's already huggingface prebuilt containers that you can use.

austinmw avatar May 10 '22 03:05 austinmw

@d-v-dlee i am also using a huggingface container and this image did worked fine for me on an aws ml.g4dn.xlarge instance. try downloading torch from this link instead https://download.pytorch.org/whl/torch_stable.html

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04

RUN pip uninstall torch -y
RUN pip uninstall torchvision -y

############# Detectron2 pre-built binaries Pytorch default install ############
RUN pip install --no-cache-dir --upgrade torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

############# Detectron2 section ##############
RUN pip install \
   --no-cache-dir pycocotools~=2.0.0 \
   --no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/detectron2-0.6%2Bcu111-cp38-cp38-linux_x86_64.whl

ENV FORCE_CUDA="1"
# Build D2 only for Volta architecture - V100 chips (ml.p3 AWS instances)
# ENV TORCH_CUDA_ARCH_LIST="Volta"

# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"

salmenhsairi avatar May 10 '22 09:05 salmenhsairi

Instead of installing and uninstalling torch and torchvision, turning debugger_hook_config to False helped resolve the smdebug.

This is with the latest Huggingface container (pytorch 1.10 and cuda 11.3)

huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='./scripts',
                                    instance_type='ml.p3.2xlarge',
                                    image_uri = base_image_uri,
                                    instance_count=1,
                                    role=role,
                                    transformers_version='4.17',
                                    pytorch_version='1.10',
                                    py_version='py38',
                                    debugger_hook_config=False,
                                    volume_size=50,
                                    hyperparameters = hyperparameters)

d-v-dlee avatar May 11 '22 16:05 d-v-dlee