deep-learning-containers
deep-learning-containers copied to clipboard
[bug] Detectron2 errors when installing on PyTorch DLC
Checklist
- [X] I've prepended issue tag with type of change: [bug]
- [X] (If applicable) I've attached the script to reproduce the bug
- [X] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [X] (If applicable) I've documented below the tests I've run on the DLC image
- [X] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
- [] I've built my own container based off DLC (and I've attached the code used to build my own image)
Concise Description:
Detectron2 errors when being installed on top of pytorch-training container. It appears to be related to smdebug
.
How to reproduce:
> nvidia-docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
> pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
> python -c "from detectron2 import model_zoo"
DLC image/dockerfile:
763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
Current behavior:
Traceback:
root@fe0954d71a8e:/# python -c "from detectron2 import model_zoo" Traceback (most recent call last): File "
", line 1, in File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/init.py", line 8, in from .model_zoo import get, get_config_file, get_checkpoint_url, get_config File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/model_zoo.py", line 9, in from detectron2.modeling import build_model File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/init.py", line 2, in from detectron2.layers import ShapeSpec File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/init.py", line 2, in from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/batch_norm.py", line 4, in from fvcore.nn.distributed import differentiable_all_reduce File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/init.py", line 4, in from .focal_loss import ( File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 52, in sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss) File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script fn = torch._C._jit_script_compile( File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 838, in try_compile_fn return torch.jit.script(fn, _rcb=rcb) File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script fn = torch._C._jit_script_compile( RuntimeError: undefined value has_torch_function_variadic: File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962 >>> loss.backward() """ if has_torch_function_variadic(input, target, weight, pos_weight): ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return handle_torch_function( binary_cross_entropy_with_logits, 'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss' File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36 targets = targets.float() p = torch.sigmoid(inputs) ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE p_t = p * targets + (1 - p) * (1 - targets) loss = ce_loss * ((1 - p_t) ** gamma)
Expected behavior:
No error on import
im getting the same error. @austinmw did you ever resolve this issue?
I've had the same issue until I found this https://github.com/aws-samples/amazon-sagemaker-pytorch-detectron2/issues/8 It seems like you need to extend the DLC with the official torch and torchvision packages
@salmenhsairi That is a known workaround, but really you shouldn't need to uninstall and reinstall torch.
@austinmw Unless there's another method to upgrade the dlc existing torch version which is optimized, as detectron2 requires the complete one.
@salmenhsairi There is not currently. My point is that it would be ideal for the SageMaker version of Torch to not be modified in a way that breaks compatibility with other libraries.
I'm using the huggingface container: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10-transformers4.17-gpu-py38-cu113-ubuntu20.04
I extended the Huggingface container with the following commands:
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
RUN python -m pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
The interesting thing is that for the Huggingface container, it says it already is up-to-date with Torch and Torchvision packages.
Got a similar error as Austin.
RuntimeError:
undefined value has_torch_function_variadic:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
>>> loss.backward()
"""
if has_torch_function_variadic(input, target, weight, pos_weight):
~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return handle_torch_function(
binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
targets = targets.float()
p = torch.sigmoid(inputs)
ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
Guess ill try the Pytorch container instead...
@d-v-dlee I don't think it would show being out of date; the version still matches. The problem is that it's been modified. You could uninstall and reinstall torch, though in your case there's already huggingface prebuilt containers that you can use.
@d-v-dlee i am also using a huggingface container and this image did worked fine for me on an aws ml.g4dn.xlarge instance. try downloading torch from this link instead https://download.pytorch.org/whl/torch_stable.html
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.9.1-transformers4.12.3-gpu-py38-cu111-ubuntu20.04
RUN pip uninstall torch -y
RUN pip uninstall torchvision -y
############# Detectron2 pre-built binaries Pytorch default install ############
RUN pip install --no-cache-dir --upgrade torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
############# Detectron2 section ##############
RUN pip install \
--no-cache-dir pycocotools~=2.0.0 \
--no-cache-dir https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.9/detectron2-0.6%2Bcu111-cp38-cp38-linux_x86_64.whl
ENV FORCE_CUDA="1"
# Build D2 only for Volta architecture - V100 chips (ml.p3 AWS instances)
# ENV TORCH_CUDA_ARCH_LIST="Volta"
# Set a fixed model cache directory. Detectron2 requirement
ENV FVCORE_CACHE="/tmp"
Instead of installing and uninstalling torch and torchvision, turning debugger_hook_config
to False helped resolve the smdebug.
This is with the latest Huggingface container (pytorch 1.10 and cuda 11.3)
huggingface_estimator = HuggingFace(entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
image_uri = base_image_uri,
instance_count=1,
role=role,
transformers_version='4.17',
pytorch_version='1.10',
py_version='py38',
debugger_hook_config=False,
volume_size=50,
hyperparameters = hyperparameters)