Ideas for reducing allocations?
I've been looking for low hanging fruit for reducing allocations but not made much of a breakthrough on reducing the size of allocations. #23 has some ideas, which halve the number of allocations but don't make a dent on the allocated memory size
Looking at a forward pass of YOLO.v3_COCO(), on master:
julia> @btime yolomod(batch)
1.674 s (20354 allocations: 1.29 GiB)
on #23 :
1.794 s (9732 allocations: 1.29 GiB)
timing each layer with TimerOutputs gives:
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 1504s / 2.84% 45.9GiB / 63.8%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────
chain 1 23 6.52s 15.2% 283ms 5.03GiB 17.2% 224MiB
chain 2 23 4.85s 11.3% 211ms 4.45GiB 15.2% 198MiB
chain 30 23 3.66s 8.56% 159ms 2.75GiB 9.40% 122MiB
chain 4 23 2.76s 6.45% 120ms 2.22GiB 7.60% 99.0MiB
chain 3 23 2.42s 5.65% 105ms 1.66GiB 5.67% 73.9MiB
chain 26 23 1.77s 4.13% 76.8ms 661MiB 2.21% 28.7MiB
chain 6 23 1.65s 3.87% 71.9ms 850MiB 2.84% 37.0MiB
chain 28 23 1.51s 3.54% 65.7ms 0.98GiB 3.33% 43.4MiB
chain 7 23 1.36s 3.18% 59.1ms 850MiB 2.84% 37.0MiB
chain 12 23 1.32s 3.08% 57.3ms 850MiB 2.84% 37.0MiB
chain 10 23 1.24s 2.90% 53.9ms 850MiB 2.84% 37.0MiB
chain 5 23 1.21s 2.84% 52.8ms 850MiB 2.84% 37.0MiB
chain 8 23 1.19s 2.78% 51.7ms 850MiB 2.84% 37.0MiB
chain 11 23 1.10s 2.57% 47.8ms 850MiB 2.84% 37.0MiB
chain 9 23 1.09s 2.55% 47.3ms 850MiB 2.84% 37.0MiB
chain 15 23 855ms 2.00% 37.2ms 425MiB 1.42% 18.5MiB
chain 13 23 754ms 1.76% 32.8ms 289MiB 0.96% 12.5MiB
chain 14 23 732ms 1.71% 31.8ms 425MiB 1.42% 18.5MiB
chain 20 23 683ms 1.60% 29.7ms 425MiB 1.42% 18.5MiB
chain 16 23 666ms 1.56% 29.0ms 425MiB 1.42% 18.5MiB
chain 19 23 648ms 1.52% 28.2ms 425MiB 1.42% 18.5MiB
chain 21 23 640ms 1.50% 27.8ms 425MiB 1.42% 18.5MiB
chain 18 23 631ms 1.48% 27.4ms 425MiB 1.42% 18.5MiB
chain 17 23 626ms 1.46% 27.2ms 425MiB 1.42% 18.5MiB
chain 29 23 568ms 1.33% 24.7ms 364MiB 1.22% 15.8MiB
chain 27 23 528ms 1.23% 22.9ms 171MiB 0.57% 7.43MiB
chain 24 23 478ms 1.12% 20.8ms 213MiB 0.71% 9.25MiB
chain 23 23 476ms 1.11% 20.7ms 213MiB 0.71% 9.25MiB
chain 25 23 468ms 1.09% 20.3ms 213MiB 0.71% 9.25MiB
chain 22 23 359ms 0.84% 15.6ms 144MiB 0.48% 6.28MiB
───────────────────────────────────────────────────────────────────
This can be run on #23 with:
using ObjectDetector
yolomod = YOLO.v3_COCO()
batch = emptybatch(yolomod)
res = yolomod(batch, detectThresh=0.2, overlapThresh=0.8) #run this a few times
display(YOLO.to)
Master:
julia> ObjectDetector.benchmark()
┌──────────────────┬─────────┬───────────────┬──────┬──────────────┬────────────────┐
│ Model │ loaded? │ load time (s) │ ran? │ run time (s) │ run time (fps) │
├──────────────────┼─────────┼───────────────┼──────┼──────────────┼────────────────┤
│ v2_tiny_416_COCO │ true │ 0.364 │ true │ 0.2055 │ 4.9 │
│ v3_tiny_416_COCO │ true │ 0.349 │ true │ 0.1802 │ 5.5 │
│ v3_320_COCO │ true │ 2.468 │ true │ 1.6058 │ 0.6 │
│ v3_416_COCO │ true │ 2.693 │ true │ 1.7486 │ 0.6 │
│ v3_608_COCO │ true │ 2.938 │ true │ 1.8223 │ 0.5 │
└──────────────────┴─────────┴───────────────┴──────┴──────────────┴────────────────┘
#23
┌──────────────────┬─────────┬───────────────┬──────┬──────────────┬────────────────┐
│ Model │ loaded? │ load time (s) │ ran? │ run time (s) │ run time (fps) │
├──────────────────┼─────────┼───────────────┼──────┼──────────────┼────────────────┤
│ v2_tiny_416_COCO │ true │ 0.39 │ true │ 0.201 │ 5.0 │
│ v3_tiny_416_COCO │ true │ 1.158 │ true │ 0.2385 │ 4.2 │
│ v3_320_COCO │ true │ 2.682 │ true │ 1.7288 │ 0.6 │
│ v3_416_COCO │ true │ 2.571 │ true │ 1.6893 │ 0.6 │
│ v3_608_COCO │ true │ 2.928 │ true │ 1.5145 │ 0.7 │
└──────────────────┴─────────┴───────────────┴──────┴──────────────┴────────────────┘
I don't feel like the size of allocations is as problematic as the fact that new ones are made when we should be able to do almost everything in-place. https://github.com/r3tex/ObjectDetector.jl/blob/46ce9aaa3a8e1d57dfa65d5f283e7e05691d21ca/src/yolo/yolo.jl#L542
Here we should write yolo.W[0] .= img for example. I saw you changed that in a couple of places, but I'm fairly certain that Flux does lots of moving around internally, probably in Conv layers and so on.
We could probably add a preallocated matrix in each yolo.out for the output weights we create on line 557. I don't know if pushing those arrays to outweights just passes references or if that also allocates new memory.
Lastly, the last parts should be optimized for GPU. For instance, the keepdetections function is really suboptimal. I'm a bit embarassed that I never managed to write a better one, but it's because a single kernel with cumsum and everything else would need to use some nontrivial tree-like algorithms to scale to the number of CUDA threads.
Everything after that should be on GPU as well.
Good tips!
I preallocated yolo.W[0] and made the suggested change in #23. Doesn't look like a substantial change, but good to do.
I've also preallocated outweights, which has reduced GPU allocation from 9MB to 2MB, but no dent on the 1.29 GB allocation on CPU
I just added a different keepdetections approach for Cu. What do you think?
https://github.com/r3tex/ObjectDetector.jl/blob/cf2c149c68bb3dbd5b0f57965c14caa18c115c51/src/yolo/yolo.jl#L491-L494
We get the same times, on quick tests. Ideally we'd use a repeat that doesn't copy, to generate the logical index (added here) but I couldn't figure one out.
Btw, up until line 577 it's about 7ms, where the full run is 31ms
Is that whole CPU allocation coming from the last part after keepdetections?
Your version of that function is certainly more readable. I would have expected it to be slower since you're creating the new array, but since mine has two kernels and the cumsum is suboptimal, let's keep your simpler version.