dlib-pytorch-benchmark
dlib-pytorch-benchmark copied to clipboard
A very naive and simple benchmark between dlib and pytorch in terms of space and time
dlib-pytorch-benchmark
A very naive and simple benchmark between dlib master and PyTorch 1.4.1 in terms of space and time.
This benchmarks were run on a NVIDIA GeForce GTX 1080 Ti with CUDA 10.2.89 and CUDNN 7.6.5.32 on Arch Linux.
Model instantiation
Probably, this is a completely useless benchmark, but it's provided for completion, nonetheless.
PyTorch
model = resnet50(pretrained=False)
dlib
resnet<dlib::affine>::n50 net;
1st inference
This is also not very meaningful, since most of the time is spent allocating memory in the GPU.
PyTorch
x = torch.zeros(32, 3, 224, 224)
x = x.cuda()
model = model.cuda()
# time measurement start
out = model(x)
# time measurement end
dlib
dlib::matrix<dlib::rgb_pixel> image(224, 224);
dlib::assign_all_pixels(image, dlib::rgb_pixel(0, 0, 0));
std::vector<dlib::matrix<dlib::rgb_pixel>> minibatch(512, image);
At this point, we could just call:
const auto out = net(minibatch, 512);
But that wouldn't be a fair comparison, since it would do some extra work:
- apply softmax to the output of the net
- transfer the result from the device to the host
As a result, we need to forward a tensor that is already in the device. There are several ways of doing it, here's one:
dlib::resizable tensor x;
net.to_tensor(minibatch.begin(), minibatch.end(), x);
x.device();
// time measurement start
net.forward(x);
// time measurement end
Now dlib is doing exactly the same operations as PyTorch, as far as I know.
Next inferences
In my opinion, the most important benchmark is this one. It measures how the network performs once it has been "warmed up".
For this part, I decided not to count the cuda syncronization time, only the inference time for a tensor that is already in the device.
PyTorch
In PyTorch, every time I forward the network, I make sure all the transfers between the host and the device have been finished:
for i in range(10):
x = x.cpu().cuda()
# time measurement start
out = model(x)
# time measurement end
dlib
For dlib I followed a similar pattern:
for (int i = 0; i < 10; ++i)
{
x.host();
x.device();
// time measurement start
net.subnet().forward(x);
// time measurement end
}
Results
The first table shows the VRAM usage in MiB and the average timings in ms for different batch sizes for a tensor of shape Nx3x224x224.
Memory | (MiB) | Time | (ms) | |||
---|---|---|---|---|---|---|
batch size | dlib | PyTorch | Factor | dlib | PyTorch | Factor |
1 | 638 | 721 | 0.885 | 6.886 | 10.048 | 0.685 |
2 | 710 | 719 | 0.987 | 7.845 | 11.449 | 0.685 |
4 | 836 | 739 | 1.131 | 11.373 | 14.095 | 0.807 |
8 | 1074 | 775 | 1.386 | 17.504 | 19.303 | 0.907 |
16 | 1512 | 889 | 1.701 | 31.288 | 30.628 | 1.022 |
32 | 2510 | 1219 | 2.059 | 60.348 | 56.571 | 1.067 |
64 | 4342 | 1699 | 2.556 | 117.544 | 105.139 | 1.118 |
128 | 7976 | 2313 | 3.448 | 224.402 | 202.120 | 1.110 |
Results for the complete train cycle (transfer + forward + backward + loss + optimize):
Memory | (MiB) | Time | (ms) | |||
---|---|---|---|---|---|---|
batch size | dlib | PyTorch | Factor | dlib | PyTorch | Factor |
1 | 973 | 991 | 0.982 | 39.292 | 47.571 | 0.826 |
2 | 1248 | 1119 | 1.115 | 29.308 | 51.219 | 0.572 |
4 | 1708 | 1281 | 1.333 | 40.95 | 60.329 | 0.679 |
8 | 2548 | 1645 | 1.549 | 65.193 | 78.995 | 0.825 |
16 | 4096 | 2389 | 1.715 | 113.596 | 116.117 | 0.978 |
32 | 7240 | 4061 | 1.783 | 218.968 | 203.942 | 1.074 |
Conclusions
Regarding the inference time, dlib since to be substantially faster with small batch sizes (up to 8 samples) by taking between 10-30% less time than PyTorch. As the batch size increases, the differences between both toolkits becomes minor and PyTorch seems to be faster for higher batch sizes
For the training time, dlib also seems to be faster that PyTorch by a substancial amount for small batch sizes.
As for the memory usage, PyTorch models are stateless, meaning that one can't access any intermediate values of the model.
On the dlib side, we can call subnet()
on our net
and then get the outputs, gradients (if we performed a backward pass), which makes it very easy to extract attention maps and perform grad-cam visualization.
That can explain the differences in memory usage by both toolkits.
However, I did observe that PyTorch memory peaks at 2929 MiB and 3843 MiB for batch sizes of 1 and 128 respectively. This is caused by the torch.backends.cudnn.benchmark = True
setting.