deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[feature-request] [design clarification] Is it possible to have single docker image for both training and inference from mxnet 1.8 onwards ?

Open MaheshGoudT opened this issue 5 years ago • 3 comments
trafficstars

Checklist

  • [x] I've prepended issue tag with type of change: [feature]
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [x] (If applicable) I've documented the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: Until MXNet 1.6 the same training image could be used for both training/inference. Starting from MXNet 1.7 separate docker images are recommended for training and inference as per https://github.com/aws/deep-learning-containers/blob/master/available_images.md. Based on my discussions with Sandeep Krishnamurthy within Amazon, I learned that "Inference image for MXNet 1.7 is optimized with MKL BLAS. Intel merged the MKL's implementation of BLAS operation to OneDNN which is used by default by MXNet on CPU". Appreciate any help on following questions 1Q) Is the reason for having separate images for training and inference for MXNet 1.7 primarily related to MKL BLAS (or) are there other reasons ? 2Q) Moving forward (MXNet1.8 onwards) could we expect to have separate images or same image for training and inference

DLC image/dockerfile: MXNet 1.7 (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-gpu-py36-cu101-ubuntu16.04)

Is your feature request related to a problem? Please describe. When I tried to upgrade our pipelines to MXNet 1.7, our SageMaker training jobs were "In progress" model and never ended. We do a sys.exit(0) at the end of training job to signal SageMaker about a successful training job. Though our training job ran to completion, produced all required logs, the SageMaker training job was "In progress" mode forever due to using MXNet 1.7 GPU Inference image (763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.7.0-gpu-py36-cu101-ubuntu16.04) for training on a CPU instance (ml.m5.4xlarge) . Our pipelines are built such that we can use only 1 image across training/inference. We will add functionality to support different training/inference image with time on our end. But for now we are planning to use MXNet 1.6 as the image supports both training/inference and it worked smoothly for us in above setting. It would be great if there is a single image that supports both training/inference from MXNet 1.8 onwards

Describe the solution you'd like If possible having 1 single image that supports both training and inference (like MXNet 1.6)

MaheshGoudT avatar Nov 10 '20 07:11 MaheshGoudT

@saimidu @anankira FYI

samskalicky avatar Nov 10 '20 19:11 samskalicky

Hi @MaheshGoudT,

  1. The inference image for mx1.7 is optimized with MKL BLAS, while the training image does not. Also the training image has distributed related optmizations for training speed. This is the main difference between the two images.

  2. Currently there is no plan to use the same image for both training and inference(This is a question for Sagemaker). The training image should be able to do inference jobs(unless you are using SM's DSL then there might be checks from within SM side to prevent this). The DLC images for mx1.8 will be released soon. We can't comment on how SM will provide those to their customers.

access2rohit avatar Nov 17 '20 00:11 access2rohit

Wondering if we can loop in some one from SM team to comment on whether we could expect different or same images for training and inference images (MXNet 1.8). The problem I mentioned in description happened when we used inference image for training with MXNet 1.7

MaheshGoudT avatar Nov 17 '20 03:11 MaheshGoudT