sagemaker-101-workshop icon indicating copy to clipboard operation
sagemaker-101-workshop copied to clipboard

MNIST model CPU training broken in TF v2.7 (conda_tensorflow2_p37 kernel on NBI ALv2 JLv3)

Open athewsey opened this issue 2 years ago • 0 comments

The current conda_tensorflow2_p38 kernel on the latest SageMaker Notebook Instance platform (notebook-al2-v2, as used in the CFn template) seems to break local CPU-only training for the MNIST migration challenge.

In this environment (TF v2.7.1, TF.Keras v2.7.0), tensorflow.keras.backend.image_data_format() asks for channels_first, but training fails because MaxPoolingOp only supports channels_last on CPU - per the error message below:

InvalidArgumentError:  Default MaxPoolingOp only supports NHWC on device type CPU
	 [[node sequential/max_pooling2d/MaxPool
 (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/layers/pooling.py:357)
]] [Op:__inference_train_function_862]

Errors may have originated from an input operation.
Input Source operations connected to node sequential/max_pooling2d/MaxPool:
In[0] sequential/conv2d_1/Relu (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/backend.py:4867)

Overriding the image_data_format() check (in "Pre-Process the Data for our CNN") to prepare data in different shape does not work because the model is incompatible (will raise ValueError in conv2d_2).

Still seems to be working fine in current SMStudio kernel (TensorFlow v2.3.2, TF.Keras v2.4.0).

athewsey avatar Jun 23 '22 05:06 athewsey