MLOps icon indicating copy to clipboard operation
MLOps copied to clipboard

Not sure if this is a TensorFlow issue or Docker issue

Open chattertonc09 opened this issue 6 years ago • 2 comments

getting a strange error on one of my embedding layers when using this with keras.

restype:container 2019-08-14 21:00:10,145|azureml.core.authentication|DEBUG|Time to expire 604466.854539 seconds 2019-08-14 azureml.history._tracking.PythonWorkingDirectory.workingdir|DEBUG|Calling pyfs 2019-08-14 21:00:29,324|azureml.history._tracking.PythonWorkingDirectory|INFO|Current working dir: /mnt/batch/tasks/.... 2019-08-14 2019-08-14 21:00:29,324|azureml.WorkingDirectoryCM|ERROR|<class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>: indices[8,0] = 565 is not in [0, 562) [[node master_Embedding/GatherV2 (defined at /azureml-envs/azureml/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

invalidArgumentError (see above for traceback): indices[8,0] = 565 is not in [0, 562) [[node broker_master_Embedding/GatherV2 (defined at /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1211) ]]

any ideas....

The driver_log.txt shows:

WARNING - From /azureml-envs/azureml_d582dd13e83051343c8ab0e51ab5a504/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Train on 72626 samples, validate on 4035 samples Epoch 1/100 2019-08-14 21:00:15.382966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-08-14 21:00:15.388250: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz 2019-08-14 21:00:15.388560: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55dbbf606c20 executing computations on platform Host. Devices: 2019-08-14 21:00:15.388579: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,

chattertonc09 avatar Aug 14 '19 21:08 chattertonc09

Hi, which script are you trying to run in this repo? Will help us debug

jpe316 avatar Aug 27 '19 18:08 jpe316

I am using the train.py where I've added a keras MLP neural network as a python class. I got through this by looking at the AmlPipelines.py where the train script was using a pythonScriptStep, I changed this to use a Tensorflow Estimator and a EstimatorStep. The issue I get now is with running with GPU support. If I try to enable GPU support for 4 nodes then use the keras multiple_gpu like this it fails because it does not recognize all the available GPUs on the cluster. Not sure if this is because of the version of Tensorflow or CUDA

with tf.device('/cpu:0'): model = Xception(weights=None, input_shape=(height, width, 3), classes=num_classes)

Replicates the model on 8 GPUs.

This assumes that your machine has 8 available GPUs.

parallel_model = multi_gpu_model(model, gpus=8) parallel_model.compile(loss='categorical_crossentropy',

chattertonc09 avatar Aug 27 '19 20:08 chattertonc09