pytorch-cpp Got different result after BN1

I converted a custom resnet18 model weights to h5 and loaded to the cpp version model which l created based on your sample code. I also inspected the weights before and after loading which are equal. Then I just did a forward pass with a dummy input. The output of first CONV was equal to the pytorch version, while the output of first BN didn't match. I can ensure the weights of BN layer are same.

It looks like there has some problems in BN layer.

Jun 22 '18 11:06 pharrellyhy

I was wondering is it related to the precision difference when I converted pth.tar to h5?

Jun 25 '18 09:06 pharrellyhy

Hi @pharrellyhy

Did you check it using the last cell here: https://github.com/warmspringwinds/pytorch-cpp/blob/master/convert_weights.ipynb

It will be different a bit.

also make sure that you have switched your pytorch code into test mode before evaluating.

Jun 25 '18 18:06 warmspringwinds

Thanks, @warmspringwinds .

I forgot to switch to eval mode and now the results are equal. Cheers!

There is another problem with cpp version, which speed is few times slower than pytorch version. The forward pass for pytorch version with input tensor ones((20, 1, 112, 112)) takes about 5ms while the cpp version takes about 35ms. Can you give me some advice on this part? Thanks!

Jun 26 '18 08:06 pharrellyhy

@pharrellyhy Could you please provide me with code that you use to benchmark both pytorch and pytorch-cpp versions?

It's really important to do it right.

Jun 26 '18 14:06 warmspringwinds

@pharrellyhy

Pytorch-cpp: https://github.com/warmspringwinds/pytorch-cpp/blob/master/examples/resnet_18_8s_benchmark.cpp#L55

Pytorch: https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb

cell number 4 in the above ipynb notebook

Jun 26 '18 14:06 warmspringwinds

@warmspringwinds

Yes, the method to benchmark cpp version is the same as yours. Here is the code snippet:

  auto net = torch::resnet18();

  net->load_weights("../checkpoints/resnet18_no_drop.h5");
  net->cuda();  // puts net on cuda after loading weights

  Tensor dummy_input = CUDA(kFloat).ones({20, 1, 112, 112});

  high_resolution_clock::time_point t1;
  high_resolution_clock::time_point t2;

  cudaDeviceSynchronize();

  t1 = high_resolution_clock::now();

  auto result = net->net_forward(dummy_input);

  cudaDeviceSynchronize();

  t2 = high_resolution_clock::now();

  auto duration = duration_cast<milliseconds>( t2 - t1 ).count();

  // Now running in a loop and getting an average result.

  int number_of_iterations = 20;
  int overall_miliseconds_count = 0;

  for (int i = 0; i < number_of_iterations; ++i) {
    t1 = high_resolution_clock::now();

    result = net->net_forward(dummy_input);

    cudaDeviceSynchronize();

    t2 = high_resolution_clock::now();

    duration = duration_cast<milliseconds>( t2 - t1 ).count();

    overall_miliseconds_count += duration;

  }

  cout << "Average execution time: " << overall_miliseconds_count / float( \
      number_of_iterations) << " ms" << endl;

This gives the result about 35ms.

For pytorch, I used LineProfiler. Here is the code snippet:

from line_profiler import LineProfiler

def profile(follow=[]):
  def inner(fn):
    def profiled_fn(*args, **kwargs):
      try:
        profiler = LineProfiler()
        profiler.add_function(fn)
        for f in follow:
          profiler.add_function(f)

        profiler.enable_by_count()
        return fn(*args, *kwargs)
      finally:
        profiler.print_stats()
    return profiled_fn
  return inner

And here is the result:

Timer unit: 1e-06 s

Total time: 17.57 s
Function: predict_whole_img_with_label at line 96

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    96                                               @profile()
    97                                               def predict_whole_img_with_label(self, label_path, num_x_window,
    98                                                       num_y_window, stride, crop_size, output_path):
    99         1        739.0    739.0      0.0          self.model.eval()
   100
   101         1       8222.0   8222.0      0.0          keypoints_frame = pd.read_csv(label_path, header=None)
   102         1         24.0     24.0      0.0          dirname = os.path.dirname(label_path)
   103
   104         1          2.0      2.0      0.0          total_false_positives = 0
   105         1          1.0      1.0      0.0          total_false_negatives = 0
   106
   107                                                   # for i in tqdm(range(len(keypoints_frame)), desc='', ncols=80, leave=True):
   108       366        867.0      2.4      0.0          for i in range(len(keypoints_frame)):
   109       365      36940.0    101.2      0.2              img_path = os.path.join(dirname, keypoints_frame.iloc[i, 0])
   110       365      10179.0     27.9      0.1              print('\nPredicting on:', img_path)
   111
   112       365     456691.0   1251.2      2.6              keypoints = keypoints_frame.iloc[i, 1:3].values
   113       365       6373.0     17.5      0.0              np_keypoints = keypoints.astype('int').reshape(-1, 2).squeeze()
   114                                                       # label = keypoints_frame.iloc[i, 3].astype('int')
   115
   116       365    5167995.0  14158.9     29.4              img = mpimg.imread(img_path)
   117
   118       365       1283.0      3.5      0.0              batch_size = num_x_window * num_y_window
   119       365      30505.0     83.6      0.2              inputs = torch.zeros((batch_size, 1, 112, 112))
   120
   121       365       1112.0      3.0      0.0              window_size = (crop_size, crop_size)
   122                                                       # make batch images from sliding window op
   123       365        968.0      2.7      0.0              for i, cropped in enumerate(sliding_window(img,
   124      7665    3152303.0    411.3     17.9                      stride, window_size, is_grayscale=True)):
   125      7300     337257.0     46.2      1.9                  inputs[i] = torch.from_numpy(cropped)
   126
   127                                                       # puts on CUDA
   128       365      85317.0    233.7      0.5              inputs = inputs.float().to(DEVICE)
   129
   130                                                       # forward pass
   131       365    2134745.0   5848.6     12.1              loc_out, label_out = self.model(inputs)

You can see from the last line, the per hit is 5848us which is 5.85ms. The batch size used here is 20 so the inputs has size (20, 1, 112, 112)

Jun 27 '18 03:06 pharrellyhy

@pharrellyhy Before we go further I can see that you have torch.zeros((batch_size, 1, 112, 112)). Most of my models work with input dimension 3 which stands for RGB channels. How did you make resnet work with less channels?

Jun 27 '18 03:06 warmspringwinds

@pharrellyhy and one more thing -- instead of doing a sliding window approach and benchmarking it there, isolate the code like in my ipynb example with just one line for inference.

Jun 27 '18 04:06 warmspringwinds

@warmspringwinds Yes, I use a custom resnet which input is grey scale image. Actually the input is not generated directly by sliding window, I concatenate all the cropped images into one 4D tensor, so the forward pass is indeed a one line for inference.

Jun 27 '18 15:06 pharrellyhy

@pharrellyhy Ok, please, could you benchmark your pytorch code like I did here: https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb (bottom)

Remove all the application specific code a leave just a simple inference line and benchmark it. I have compared these timings recently for batch sizes of 1 and 2 and they were the same for pytorch and cpp versions.

Jun 27 '18 15:06 warmspringwinds

@warmspringwinds Yes, I can do that. But the point is if ATen handles CUDA operations probably, whether batch size is 1 or 20 should give similar result.

Jun 27 '18 16:06 pharrellyhy

@pharrellyhy timing for the pytorch and cpp versions should be similar -- you are right. Let's make sure that you set up the timing experiment properly in case of pytorch. If the time will be different still, I will dig into this.

Jun 27 '18 16:06 warmspringwinds

@warmspringwinds Sure, I will let you know once I get the result. Thanks.

Jun 27 '18 16:06 pharrellyhy

@warmspringwinds Here is the result I got.

dummy_input = torch.ones((1,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
3.04 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

dummy_input = torch.ones((20,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
9.45 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Jun 28 '18 02:06 pharrellyhy

@warmspringwinds I've checked the source code and I saw there is a for loop in the CONV function for (int elt = 0; elt < batchSize; elt ++) . I am not quite familiar with CUDA but I know if we don't handle CUDA stream probably, the default stream executes sequentially.

Jun 28 '18 07:06 pharrellyhy

@pharrellyhy I will check it for my resnet and bigger batch size. There is a chance that this might have been fixed in a newer version of Aten.

Jun 28 '18 15:06 warmspringwinds

@warmspringwinds Thanks. Probably... but I have a hard time to compile it. If you have any progress, please let me know. :)

Jun 28 '18 15:06 pharrellyhy

@pharrellyhy will do. What version of pytorch did you use?

Jun 28 '18 15:06 warmspringwinds

@warmspringwinds 0.4.0

Jun 28 '18 16:06 pharrellyhy

@warmspringwinds The latest ATen APIs changed a lot. We might need a big update if switch to latest ATen.

Jun 29 '18 03:06 pharrellyhy

@warmspringwinds Any progress? :D

Jul 02 '18 02:07 pharrellyhy

pytorch-cpp pytorch-cpp copied to clipboard

Got different result after BN1

pytorch-cpp
pytorch-cpp copied to clipboard