ObjectDetector.jl Ideas for reducing allocations?

I've been looking for low hanging fruit for reducing allocations but not made much of a breakthrough on reducing the size of allocations. #23 has some ideas, which halve the number of allocations but don't make a dent on the allocated memory size

Looking at a forward pass of YOLO.v3_COCO(), on master:

julia> @btime yolomod(batch)
1.674 s (20354 allocations: 1.29 GiB)

on #23 :

1.794 s (9732 allocations: 1.29 GiB)

timing each layer with TimerOutputs gives:

                            Time                   Allocations      
                    ──────────────────────   ───────────────────────
  Tot / % measured:      1504s / 2.84%           45.9GiB / 63.8%    

 Section    ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────
 chain 1        23    6.52s  15.2%   283ms   5.03GiB  17.2%   224MiB
 chain 2        23    4.85s  11.3%   211ms   4.45GiB  15.2%   198MiB
 chain 30       23    3.66s  8.56%   159ms   2.75GiB  9.40%   122MiB
 chain 4        23    2.76s  6.45%   120ms   2.22GiB  7.60%  99.0MiB
 chain 3        23    2.42s  5.65%   105ms   1.66GiB  5.67%  73.9MiB
 chain 26       23    1.77s  4.13%  76.8ms    661MiB  2.21%  28.7MiB
 chain 6        23    1.65s  3.87%  71.9ms    850MiB  2.84%  37.0MiB
 chain 28       23    1.51s  3.54%  65.7ms   0.98GiB  3.33%  43.4MiB
 chain 7        23    1.36s  3.18%  59.1ms    850MiB  2.84%  37.0MiB
 chain 12       23    1.32s  3.08%  57.3ms    850MiB  2.84%  37.0MiB
 chain 10       23    1.24s  2.90%  53.9ms    850MiB  2.84%  37.0MiB
 chain 5        23    1.21s  2.84%  52.8ms    850MiB  2.84%  37.0MiB
 chain 8        23    1.19s  2.78%  51.7ms    850MiB  2.84%  37.0MiB
 chain 11       23    1.10s  2.57%  47.8ms    850MiB  2.84%  37.0MiB
 chain 9        23    1.09s  2.55%  47.3ms    850MiB  2.84%  37.0MiB
 chain 15       23    855ms  2.00%  37.2ms    425MiB  1.42%  18.5MiB
 chain 13       23    754ms  1.76%  32.8ms    289MiB  0.96%  12.5MiB
 chain 14       23    732ms  1.71%  31.8ms    425MiB  1.42%  18.5MiB
 chain 20       23    683ms  1.60%  29.7ms    425MiB  1.42%  18.5MiB
 chain 16       23    666ms  1.56%  29.0ms    425MiB  1.42%  18.5MiB
 chain 19       23    648ms  1.52%  28.2ms    425MiB  1.42%  18.5MiB
 chain 21       23    640ms  1.50%  27.8ms    425MiB  1.42%  18.5MiB
 chain 18       23    631ms  1.48%  27.4ms    425MiB  1.42%  18.5MiB
 chain 17       23    626ms  1.46%  27.2ms    425MiB  1.42%  18.5MiB
 chain 29       23    568ms  1.33%  24.7ms    364MiB  1.22%  15.8MiB
 chain 27       23    528ms  1.23%  22.9ms    171MiB  0.57%  7.43MiB
 chain 24       23    478ms  1.12%  20.8ms    213MiB  0.71%  9.25MiB
 chain 23       23    476ms  1.11%  20.7ms    213MiB  0.71%  9.25MiB
 chain 25       23    468ms  1.09%  20.3ms    213MiB  0.71%  9.25MiB
 chain 22       23    359ms  0.84%  15.6ms    144MiB  0.48%  6.28MiB
 ───────────────────────────────────────────────────────────────────

This can be run on #23 with:

using ObjectDetector
yolomod = YOLO.v3_COCO()
batch = emptybatch(yolomod)
res = yolomod(batch, detectThresh=0.2, overlapThresh=0.8) #run this a few times
display(YOLO.to)

Master:

julia> ObjectDetector.benchmark()
┌──────────────────┬─────────┬───────────────┬──────┬──────────────┬────────────────┐
│            Model │ loaded? │ load time (s) │ ran? │ run time (s) │ run time (fps) │
├──────────────────┼─────────┼───────────────┼──────┼──────────────┼────────────────┤
│ v2_tiny_416_COCO │    true │         0.364 │ true │       0.2055 │            4.9 │
│ v3_tiny_416_COCO │    true │         0.349 │ true │       0.1802 │            5.5 │
│      v3_320_COCO │    true │         2.468 │ true │       1.6058 │            0.6 │
│      v3_416_COCO │    true │         2.693 │ true │       1.7486 │            0.6 │
│      v3_608_COCO │    true │         2.938 │ true │       1.8223 │            0.5 │
└──────────────────┴─────────┴───────────────┴──────┴──────────────┴────────────────┘

#23

┌──────────────────┬─────────┬───────────────┬──────┬──────────────┬────────────────┐
│            Model │ loaded? │ load time (s) │ ran? │ run time (s) │ run time (fps) │
├──────────────────┼─────────┼───────────────┼──────┼──────────────┼────────────────┤
│ v2_tiny_416_COCO │    true │          0.39 │ true │        0.201 │            5.0 │
│ v3_tiny_416_COCO │    true │         1.158 │ true │       0.2385 │            4.2 │
│      v3_320_COCO │    true │         2.682 │ true │       1.7288 │            0.6 │
│      v3_416_COCO │    true │         2.571 │ true │       1.6893 │            0.6 │
│      v3_608_COCO │    true │         2.928 │ true │       1.5145 │            0.7 │
└──────────────────┴─────────┴───────────────┴──────┴──────────────┴────────────────┘

Dec 04 '19 02:12 IanButterworth

I don't feel like the size of allocations is as problematic as the fact that new ones are made when we should be able to do almost everything in-place. https://github.com/r3tex/ObjectDetector.jl/blob/46ce9aaa3a8e1d57dfa65d5f283e7e05691d21ca/src/yolo/yolo.jl#L542

Here we should write yolo.W[0] .= img for example. I saw you changed that in a couple of places, but I'm fairly certain that Flux does lots of moving around internally, probably in Conv layers and so on.

We could probably add a preallocated matrix in each yolo.out for the output weights we create on line 557. I don't know if pushing those arrays to outweights just passes references or if that also allocates new memory.

Lastly, the last parts should be optimized for GPU. For instance, the keepdetections function is really suboptimal. I'm a bit embarassed that I never managed to write a better one, but it's because a single kernel with cumsum and everything else would need to use some nontrivial tree-like algorithms to scale to the number of CUDA threads.

Everything after that should be on GPU as well.

Dec 04 '19 09:12 r3tex

Good tips! I preallocated yolo.W[0] and made the suggested change in #23. Doesn't look like a substantial change, but good to do.

I've also preallocated outweights, which has reduced GPU allocation from 9MB to 2MB, but no dent on the 1.29 GB allocation on CPU

Dec 04 '19 18:12 IanButterworth

I just added a different keepdetections approach for Cu. What do you think? https://github.com/r3tex/ObjectDetector.jl/blob/cf2c149c68bb3dbd5b0f57965c14caa18c115c51/src/yolo/yolo.jl#L491-L494

We get the same times, on quick tests. Ideally we'd use a repeat that doesn't copy, to generate the logical index (added here) but I couldn't figure one out.

Btw, up until line 577 it's about 7ms, where the full run is 31ms

Dec 05 '19 02:12 IanButterworth

Is that whole CPU allocation coming from the last part after keepdetections? Your version of that function is certainly more readable. I would have expected it to be slower since you're creating the new array, but since mine has two kernels and the cumsum is suboptimal, let's keep your simpler version.

Dec 05 '19 08:12 r3tex