nntrainer Offload values to disk

Offload values to disk

Open kparichay opened this issue 3 years ago • 6 comments

During an iteration of forwarding/backprop, we know that certain values are not going to be used for long. Such values can be offloaded to the disk and cache later right before use.

In forwarding, a layer l inputs and outputs will not be used again till this layer is backwarded or a new forward operation starts. So, the graph just need to keep m layers in the memory out of total n layers (where m <= n). For mobilenetv2, where there are over 150 layers, it is theoretically easy to offload layers inputs/outputs/weights to disk and get back when processing.

The purpose of this optimization is to reduce the peak memory consumption at the cost of disk/memory utilization (This assumes that disk and memory are not the bottleneck, but the processor is).

credits: @jijoongmoon

Dec 07 '20 01:12 kparichay

:octocat: cibot: Thank you for posting issue #789. The person in charge will reply soon.

Dec 07 '20 01:12 taos-ci

resnet 18 test (number of params 11,230,948)

unit: microseconds

10 iteration.

batch size: 10

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward                1084403                1031972                1185450               10844036
           offload                 351008                 316849                 384459                3510083
            onload                  11183                  10496                  13402                 111836

batch size: 32

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward                3331549                3167296                3536707               33315490
           offload                 897888                 861941                 948899                8978880
            onload                  29808                  25877                  41906                 298087

batch size: 128

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward               13362108               12923511               13621666              133621085
           offload                3299996                3209499                3502185               32999966
            onload                 111764                  94542                 139351                1117649

batch size: 10 + ssd

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward                1101922                1001577                1183922               11019223
           offload                 164734                  38927                 231023                1647340
            onload                  11395                  10172                  16392                 113957

Jul 01 '21 08:07 zhoonit

Is offloading/onloading working in parallel to forward+backward? it would be useful to see this impact.

Jul 01 '21 08:07 kparichay

No, it was not done is parallel.

roughly

forward() && backward()
saveModel() // with saving inputs as well
readModel() // with reading inputs as well

So there will be more things to consider like how much we are offloading it.

Jul 01 '21 08:07 zhoonit

Formal implementation of offloading would involve three threads - loading, working, and saving. As all 3 tasks would introduce load on the DRAM, the timings might change. Anyway, this can be done later. The initial estimates provide a good enough comparison, good job.

Jul 02 '21 03:07 kparichay

Offloading design:

The model would start with N layers/operations in the memory (with all the required tensors required to execute that layer. If forwarding, then all the tensors for its forwarding. If backwarding, then corresponding tensors for backwarding).
Once a layer is executed (let's call it L_0), L_1 (the next layer) will start executing. At this point, L_0 will be offloaded to the disk. At the same time, L_N+1 will start loading to the memory. By design, there will always exactly be 1 layer being offloaded, loaded, and executed (synchronization issues are discussed below). This also means that the memory being consumed at any given point is the memory of N+2 layers. The peak memory will be for the window of N+2 layers with maximum memory.
The layer being executed need not wait for the offloading/onloading to finish. Offloading need not wait for the onloading operations to finish. However, onloading has to wait for offloading to finish to guarantee the minimum memory requirements.
N (layers/operations in memory) is a hyperparameter. N also acts as a jitter buffer to ensure the smooth execution of the model. Some layers are computation heavy (like convolution) while others can be memory heavy (ReLU). Keeping N layers aim to not provide any lagging for execution.
- Design 1: start with N as fixed and set by the developer
- Design 2: allow N to vary dynamically with every few iterations. If the layer execution has to wait for the data to be loaded, N must be increased. If no layers are waiting for execution but rather offloading/onloading both threads are waiting, then N can be decreased.

With the above design, there are 4 states for a layer:

READY_FOR_EXEC: layer is in memory and is ready for execution
WAIT_FOR_EXEC: layer is being loaded to memory
WAIT_FOR_OFF: layer is being offloaded to disk
SWAPPED: layer is on disk

Note that states 1, 2, and 3 consume memory on the disk. The proposed design tries to keep a maximum number of layers in state 1, and a minimum number of layers in states 2 and 3 (this depends on the execution vs disk speed). This ensures maximum efficiency in the utilization of the memory.

Sep 30 '21 01:09 kparichay

nntrainer nntrainer copied to clipboard

Offload values to disk

batch size: 10

batch size: 32

batch size: 128

batch size: 10 + ssd

nntrainer
nntrainer copied to clipboard