deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] smdistributed is not included in HuggingFace training image

Open dbpprt opened this issue 1 year ago • 2 comments

Checklist

  • [x] I've prepended issue tag with type of change: [bug]
  • [ ] (If applicable) I've attached the script to reproduce the bug
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: smdistributed is not available.

ModuleNotFoundError: No module named ‘smdistributed’

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

Current behavior:

Expected behavior:

Additional context: Installing it manually gives the following error:

ErrorMessage "ImportError: libsmddpcpp.so: cannot open shared object file: No such file or directory

from: https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl

dbpprt avatar Jun 05 '24 10:06 dbpprt