nntrainer
nntrainer copied to clipboard
Offload values to disk
During an iteration of forwarding/backprop, we know that certain values are not going to be used for long. Such values can be offloaded to the disk and cache later right before use.
In forwarding, a layer l
inputs and outputs will not be used again till this layer is backwarded or a new forward operation starts.
So, the graph just need to keep m
layers in the memory out of total n
layers (where m <= n). For mobilenetv2, where there are over 150 layers, it is theoretically easy to offload layers inputs/outputs/weights to disk and get back when processing.
The purpose of this optimization is to reduce the peak memory consumption at the cost of disk/memory utilization (This assumes that disk and memory are not the bottleneck, but the processor is).
credits: @jijoongmoon
:octocat: cibot: Thank you for posting issue #789. The person in charge will reply soon.
resnet 18 test (number of params 11,230,948)
unit: microseconds
10 iteration.
batch size: 10
key avg min max sum
==============================================================================================================
forward + backward 1084403 1031972 1185450 10844036
offload 351008 316849 384459 3510083
onload 11183 10496 13402 111836
batch size: 32
key avg min max sum
==============================================================================================================
forward + backward 3331549 3167296 3536707 33315490
offload 897888 861941 948899 8978880
onload 29808 25877 41906 298087
batch size: 128
key avg min max sum
==============================================================================================================
forward + backward 13362108 12923511 13621666 133621085
offload 3299996 3209499 3502185 32999966
onload 111764 94542 139351 1117649
batch size: 10 + ssd
key avg min max sum
==============================================================================================================
forward + backward 1101922 1001577 1183922 11019223
offload 164734 38927 231023 1647340
onload 11395 10172 16392 113957
Is offloading/onloading working in parallel to forward+backward? it would be useful to see this impact.
No, it was not done is parallel.
roughly
forward() && backward()
saveModel() // with saving inputs as well
readModel() // with reading inputs as well
So there will be more things to consider like how much we are offloading it.
Formal implementation of offloading would involve three threads - loading, working, and saving. As all 3 tasks would introduce load on the DRAM, the timings might change. Anyway, this can be done later. The initial estimates provide a good enough comparison, good job.
Offloading design:
- The model would start with N layers/operations in the memory (with all the required tensors required to execute that layer. If forwarding, then all the tensors for its forwarding. If backwarding, then corresponding tensors for backwarding).
- Once a layer is executed (let's call it
L_0
),L_1
(the next layer) will start executing. At this point,L_0
will be offloaded to the disk. At the same time,L_N+1
will start loading to the memory. By design, there will always exactly be 1 layer being offloaded, loaded, and executed (synchronization issues are discussed below). This also means that the memory being consumed at any given point is the memory ofN+2
layers. The peak memory will be for the window ofN+2
layers with maximum memory. - The layer being executed need not wait for the offloading/onloading to finish. Offloading need not wait for the onloading operations to finish. However, onloading has to wait for offloading to finish to guarantee the minimum memory requirements.
-
N
(layers/operations in memory) is a hyperparameter. N also acts as a jitter buffer to ensure the smooth execution of the model. Some layers are computation heavy (like convolution) while others can be memory heavy (ReLU). KeepingN
layers aim to not provide any lagging for execution.- Design 1: start with N as fixed and set by the developer
- Design 2: allow N to vary dynamically with every few iterations. If the layer execution has to wait for the data to be loaded, N must be increased. If no layers are waiting for execution but rather offloading/onloading both threads are waiting, then N can be decreased.
With the above design, there are 4 states for a layer:
- READY_FOR_EXEC: layer is in memory and is ready for execution
- WAIT_FOR_EXEC: layer is being loaded to memory
- WAIT_FOR_OFF: layer is being offloaded to disk
- SWAPPED: layer is on disk
Note that states 1, 2, and 3 consume memory on the disk. The proposed design tries to keep a maximum number of layers in state 1, and a minimum number of layers in states 2 and 3 (this depends on the execution vs disk speed). This ensures maximum efficiency in the utilization of the memory.