pytorch-cpp
pytorch-cpp copied to clipboard
Got different result after BN1
I converted a custom resnet18 model weights to h5 and loaded to the cpp version model which l created based on your sample code. I also inspected the weights before and after loading which are equal. Then I just did a forward pass with a dummy input. The output of first CONV was equal to the pytorch version, while the output of first BN didn't match. I can ensure the weights of BN layer are same.
It looks like there has some problems in BN layer.
I was wondering is it related to the precision difference when I converted pth.tar to h5?
Hi @pharrellyhy
Did you check it using the last cell here: https://github.com/warmspringwinds/pytorch-cpp/blob/master/convert_weights.ipynb
It will be different a bit.
also make sure that you have switched your pytorch code into test mode before evaluating.
Thanks, @warmspringwinds .
I forgot to switch to eval mode and now the results are equal. Cheers!
There is another problem with cpp version, which speed is few times slower than pytorch version. The forward pass for pytorch version with input tensor ones((20, 1, 112, 112))
takes about 5ms while the cpp version takes about 35ms. Can you give me some advice on this part? Thanks!
@pharrellyhy Could you please provide me with code that you use to benchmark both pytorch and pytorch-cpp versions?
It's really important to do it right.
@pharrellyhy
Pytorch-cpp: https://github.com/warmspringwinds/pytorch-cpp/blob/master/examples/resnet_18_8s_benchmark.cpp#L55
Pytorch: https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb
cell number 4 in the above ipynb notebook
@warmspringwinds
Yes, the method to benchmark cpp version is the same as yours. Here is the code snippet:
auto net = torch::resnet18();
net->load_weights("../checkpoints/resnet18_no_drop.h5");
net->cuda(); // puts net on cuda after loading weights
Tensor dummy_input = CUDA(kFloat).ones({20, 1, 112, 112});
high_resolution_clock::time_point t1;
high_resolution_clock::time_point t2;
cudaDeviceSynchronize();
t1 = high_resolution_clock::now();
auto result = net->net_forward(dummy_input);
cudaDeviceSynchronize();
t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
// Now running in a loop and getting an average result.
int number_of_iterations = 20;
int overall_miliseconds_count = 0;
for (int i = 0; i < number_of_iterations; ++i) {
t1 = high_resolution_clock::now();
result = net->net_forward(dummy_input);
cudaDeviceSynchronize();
t2 = high_resolution_clock::now();
duration = duration_cast<milliseconds>( t2 - t1 ).count();
overall_miliseconds_count += duration;
}
cout << "Average execution time: " << overall_miliseconds_count / float( \
number_of_iterations) << " ms" << endl;
This gives the result about 35ms
.
For pytorch, I used LineProfiler
. Here is the code snippet:
from line_profiler import LineProfiler
def profile(follow=[]):
def inner(fn):
def profiled_fn(*args, **kwargs):
try:
profiler = LineProfiler()
profiler.add_function(fn)
for f in follow:
profiler.add_function(f)
profiler.enable_by_count()
return fn(*args, *kwargs)
finally:
profiler.print_stats()
return profiled_fn
return inner
And here is the result:
Timer unit: 1e-06 s
Total time: 17.57 s
Function: predict_whole_img_with_label at line 96
Line # Hits Time Per Hit % Time Line Contents
==============================================================
96 @profile()
97 def predict_whole_img_with_label(self, label_path, num_x_window,
98 num_y_window, stride, crop_size, output_path):
99 1 739.0 739.0 0.0 self.model.eval()
100
101 1 8222.0 8222.0 0.0 keypoints_frame = pd.read_csv(label_path, header=None)
102 1 24.0 24.0 0.0 dirname = os.path.dirname(label_path)
103
104 1 2.0 2.0 0.0 total_false_positives = 0
105 1 1.0 1.0 0.0 total_false_negatives = 0
106
107 # for i in tqdm(range(len(keypoints_frame)), desc='', ncols=80, leave=True):
108 366 867.0 2.4 0.0 for i in range(len(keypoints_frame)):
109 365 36940.0 101.2 0.2 img_path = os.path.join(dirname, keypoints_frame.iloc[i, 0])
110 365 10179.0 27.9 0.1 print('\nPredicting on:', img_path)
111
112 365 456691.0 1251.2 2.6 keypoints = keypoints_frame.iloc[i, 1:3].values
113 365 6373.0 17.5 0.0 np_keypoints = keypoints.astype('int').reshape(-1, 2).squeeze()
114 # label = keypoints_frame.iloc[i, 3].astype('int')
115
116 365 5167995.0 14158.9 29.4 img = mpimg.imread(img_path)
117
118 365 1283.0 3.5 0.0 batch_size = num_x_window * num_y_window
119 365 30505.0 83.6 0.2 inputs = torch.zeros((batch_size, 1, 112, 112))
120
121 365 1112.0 3.0 0.0 window_size = (crop_size, crop_size)
122 # make batch images from sliding window op
123 365 968.0 2.7 0.0 for i, cropped in enumerate(sliding_window(img,
124 7665 3152303.0 411.3 17.9 stride, window_size, is_grayscale=True)):
125 7300 337257.0 46.2 1.9 inputs[i] = torch.from_numpy(cropped)
126
127 # puts on CUDA
128 365 85317.0 233.7 0.5 inputs = inputs.float().to(DEVICE)
129
130 # forward pass
131 365 2134745.0 5848.6 12.1 loc_out, label_out = self.model(inputs)
You can see from the last line, the per hit is 5848us which is 5.85ms. The batch size used here is 20 so the inputs has size (20, 1, 112, 112)
@pharrellyhy Before we go further I can see that you have torch.zeros((batch_size, 1, 112, 112))
.
Most of my models work with input dimension 3 which stands for RGB channels. How did you make resnet work with less channels?
@pharrellyhy and one more thing -- instead of doing a sliding window approach and benchmarking it there, isolate the code like in my ipynb example with just one line for inference.
@warmspringwinds Yes, I use a custom resnet which input is grey scale image. Actually the input is not generated directly by sliding window, I concatenate all the cropped images into one 4D tensor, so the forward pass is indeed a one line for inference.
@pharrellyhy Ok, please, could you benchmark your pytorch code like I did here: https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb (bottom)
Remove all the application specific code a leave just a simple inference line and benchmark it. I have compared these timings recently for batch sizes of 1 and 2 and they were the same for pytorch and cpp versions.
@warmspringwinds Yes, I can do that. But the point is if ATen handles CUDA operations probably, whether batch size is 1 or 20 should give similar result.
@pharrellyhy timing for the pytorch and cpp versions should be similar -- you are right. Let's make sure that you set up the timing experiment properly in case of pytorch. If the time will be different still, I will dig into this.
@warmspringwinds Sure, I will let you know once I get the result. Thanks.
@warmspringwinds Here is the result I got.
dummy_input = torch.ones((1,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
3.04 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dummy_input = torch.ones((20,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
9.45 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@warmspringwinds I've checked the source code and I saw there is a for loop in the CONV function for (int elt = 0; elt < batchSize; elt ++)
. I am not quite familiar with CUDA but I know if we don't handle CUDA stream probably, the default stream executes sequentially.
@pharrellyhy I will check it for my resnet and bigger batch size. There is a chance that this might have been fixed in a newer version of Aten.
@warmspringwinds Thanks. Probably... but I have a hard time to compile it. If you have any progress, please let me know. :)
@pharrellyhy will do. What version of pytorch did you use?
@warmspringwinds 0.4.0
@warmspringwinds The latest ATen APIs changed a lot. We might need a big update if switch to latest ATen.
@warmspringwinds Any progress? :D