ACCURACY ISSUE: Nan value is observed fot trianing loss on gpu path

Open sriharikarnam opened this issue 7 years ago • 0 comments

Issue Accuracy issue related to nan loss is observed on executing the mxnet training application on gpu where as on cpu we get the approriate output.

For example:

Training on CPU: On executing the training application on cpu, i.e with mx.cpu() launching the training on cpu

The output of example of adversary_generation.py

[14:53:28] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28) [14:53:30] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28) Train Accuracy: 0.92 Train Loss: 0.28075 Train Accuracy: 0.98 Train Loss: 0.08431 Train Accuracy: 0.98 Train Loss: 0.05848 Train Accuracy: 0.99 Train Loss: 0.04576 ('Val Batch Accuracy: ', 0.99) ('Val Batch Accuracy after pertubation: ', 0.02)

Training on GPU: Similarly on executing the adversary application on gpu, by modifying python code as mx.gpu() we get the below output,

python adversary_generation.py

[15:21:34] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28) [15:21:35] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28) Train Accuracy: 0.10 Train Loss: nan Train Accuracy: 0.10 Train Loss: nan Train Accuracy: 0.10 Train Loss: nan Train Accuracy: 0.10 Train Loss: nan ('Val Batch Accuracy: ', 0.13) ('Val Batch Accuracy after pertubation: ', 0.13)

From the above outputs we can see the training accuracy values and training loss is updating on CPU where as training accuracy values are stagnant on GPU and the training loss is nan.

Observation:

While executing the adversary application we came across the model.forward() functionality.

On further debugging the application python code we were able to trace the issue is from Forward() in mxnet installation directory /usr/local/lib/python2.7/dist-packages/mxnet-0.9.5-py2.7.egg/mxnet/executor.py at line number 122. The output of forward() funtion is nan after executing.

The forward functionality is defined in the source code(c++) in src/operator/convolution-inl.h of mxnet code .Identifying the what is leading to nan from forward() is complex.We suspect there might be some bad kernel which is leading to the bug.

The forward functionality defintion

virtual void Forward(const OpContext &ctx, const std::vector<TBlob> &in_data, const std::vector<OpReqType> &req, const std::vector<TBlob> &out_data, const std::vector<TBlob> &aux_args) { using namespace mshadow; using namespace mshadow::expr; CHECK_EQ(req[conv::kOut], kWriteTo); size_t expected = param_.no_bias ? 2 : 3; CHECK_EQ(in_data.size(), expected); CHECK_EQ(out_data.size(), 1U); CHECK_EQ(req[conv::kOut], kWriteTo); LayerSetUp(in_data[conv::kData].shape_, out_data[conv::kOut].shape_); Stream* s = ctx.get_stream(); // allocate workspace for col_buffer Tensor<xpu, 1, DType> workspace = ctx.requested[conv::kTempSpace] .get_space_typed<xpu, 1, DType>(Shape1(col_buffer_size_), s); // calculate the shape of col_buffer TShape col_buffer_shape(num_spatial_axes_ + 1); col_buffer_shape[0] = conv_in_channels_ * param_.kernel.Size(); for (index_t i = 1; i < col_buffer_shape.ndim(); ++i) { col_buffer_shape[i] = out_data[0].shape_[i+1]; } // create a column buffer using workspace and col_buffer_shape TBlob col_buffer(workspace.dptr_, col_buffer_shape, xpu::kDevMask, DataType<DType>::kFlag);

// initialize weight and col_buffer 3D tensors for using gemm
index_t M = conv_out_channels_ / group_;
index_t N = conv_out_spatial_dim_;
index_t K = kernel_dim_;
Tensor<xpu, 3, DType> weight_3d = in_data[conv::kWeight].get_with_shape<xpu, 3, DType>(
  Shape3(group_, M, K), s);
Tensor<xpu, 3, DType> col_buffer_3d = col_buffer.get_with_shape<xpu, 3, DType>(
  Shape3(group_, K, N), s);
Tensor<xpu, 4, DType> output_4d = out_data[conv::kOut].get_with_shape<xpu, 4, DType>(
  Shape4(num_, group_, M, N), s);
for (index_t n = 0; n < num_; ++n) {
  // transform image to col_buffer in order to use gemm
  im2col(s, in_data[conv::kData].dptr<DType>()+n*input_dim_, in_data[conv::kData].shape_,
         col_buffer.shape_, param_.kernel, param_.pad, param_.stride, param_.dilate,
         col_buffer.dptr<DType>());
  Tensor<xpu, 3, DType> output_3d = output_4d[n];
  for (index_t g = 0; g < group_; ++g) {
    ASSIGN_DISPATCH(output_3d[g], req[conv::kOut], dot(weight_3d[g], col_buffer_3d[g]));
  }
}
if (bias_term_) {
  Tensor<xpu, 1, DType> bias = in_data[conv::kBias].get<xpu, 1, DType>(s);
  Tensor<xpu, 3, DType> output_3d = out_data[conv::kOut].get_with_shape<xpu, 3, DType>(
    Shape3(num_, conv_out_channels_, conv_out_spatial_dim_), s);
  // has bias term, broadcast it to the same shape of output_3d in channel dim
  output_3d += mshadow::expr::broadcast<1>(bias, output_3d.shape_);
}

}

We have collected the list of all kernels being execercised in the example The list of kernels executed is captured by exporting HIP_TRACE_API=2.

Please find the log of list of kernels attached kernels_list.txt

Steps to Reproduce

$ git clone --recursive https://github.com/ROCmSoftwarePlatform/mxnet.git
$ cd mxnet
$ export HIP_PLATFORM=hcc
$ make -jn (n = no of cores)
$ cd python
$ sudo python setup.py install
$ cd ../example/adversary
$ python adversary_generation.py NOTE: to run on cpu modify code as mx.cpu() and to run on gpu modify code as mx.gpu().

Please suggest to identify the bad kernels which might lead to nan in the loss for the following example.

Oct 23 '18 08:10 sriharikarnam