nntrainer icon indicating copy to clipboard operation
nntrainer copied to clipboard

Offload values to disk

Open kparichay opened this issue 3 years ago • 6 comments

During an iteration of forwarding/backprop, we know that certain values are not going to be used for long. Such values can be offloaded to the disk and cache later right before use.

In forwarding, a layer l inputs and outputs will not be used again till this layer is backwarded or a new forward operation starts. So, the graph just need to keep m layers in the memory out of total n layers (where m <= n). For mobilenetv2, where there are over 150 layers, it is theoretically easy to offload layers inputs/outputs/weights to disk and get back when processing.

The purpose of this optimization is to reduce the peak memory consumption at the cost of disk/memory utilization (This assumes that disk and memory are not the bottleneck, but the processor is).

credits: @jijoongmoon

kparichay avatar Dec 07 '20 01:12 kparichay

:octocat: cibot: Thank you for posting issue #789. The person in charge will reply soon.

taos-ci avatar Dec 07 '20 01:12 taos-ci

resnet 18 test (number of params 11,230,948)

unit: microseconds

10 iteration.

batch size: 10

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward                1084403                1031972                1185450               10844036
           offload                 351008                 316849                 384459                3510083
            onload                  11183                  10496                  13402                 111836

batch size: 32

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward                3331549                3167296                3536707               33315490
           offload                 897888                 861941                 948899                8978880
            onload                  29808                  25877                  41906                 298087

batch size: 128

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward               13362108               12923511               13621666              133621085
           offload                3299996                3209499                3502185               32999966
            onload                 111764                  94542                 139351                1117649

batch size: 10 + ssd

               key                    avg                    min                    max                    sum
==============================================================================================================
forward + backward                1101922                1001577                1183922               11019223
           offload                 164734                  38927                 231023                1647340
            onload                  11395                  10172                  16392                 113957

zhoonit avatar Jul 01 '21 08:07 zhoonit

Is offloading/onloading working in parallel to forward+backward? it would be useful to see this impact.

kparichay avatar Jul 01 '21 08:07 kparichay

No, it was not done is parallel.

roughly

forward() && backward()
saveModel() // with saving inputs as well
readModel() // with reading inputs as well

So there will be more things to consider like how much we are offloading it.

zhoonit avatar Jul 01 '21 08:07 zhoonit

Formal implementation of offloading would involve three threads - loading, working, and saving. As all 3 tasks would introduce load on the DRAM, the timings might change. Anyway, this can be done later. The initial estimates provide a good enough comparison, good job.

kparichay avatar Jul 02 '21 03:07 kparichay

Offloading design:

  • The model would start with N layers/operations in the memory (with all the required tensors required to execute that layer. If forwarding, then all the tensors for its forwarding. If backwarding, then corresponding tensors for backwarding).
  • Once a layer is executed (let's call it L_0), L_1 (the next layer) will start executing. At this point, L_0 will be offloaded to the disk. At the same time, L_N+1 will start loading to the memory. By design, there will always exactly be 1 layer being offloaded, loaded, and executed (synchronization issues are discussed below). This also means that the memory being consumed at any given point is the memory of N+2 layers. The peak memory will be for the window of N+2 layers with maximum memory.
  • The layer being executed need not wait for the offloading/onloading to finish. Offloading need not wait for the onloading operations to finish. However, onloading has to wait for offloading to finish to guarantee the minimum memory requirements.
  • N (layers/operations in memory) is a hyperparameter. N also acts as a jitter buffer to ensure the smooth execution of the model. Some layers are computation heavy (like convolution) while others can be memory heavy (ReLU). Keeping N layers aim to not provide any lagging for execution.
    • Design 1: start with N as fixed and set by the developer
    • Design 2: allow N to vary dynamically with every few iterations. If the layer execution has to wait for the data to be loaded, N must be increased. If no layers are waiting for execution but rather offloading/onloading both threads are waiting, then N can be decreased.

With the above design, there are 4 states for a layer:

  1. READY_FOR_EXEC: layer is in memory and is ready for execution
  2. WAIT_FOR_EXEC: layer is being loaded to memory
  3. WAIT_FOR_OFF: layer is being offloaded to disk
  4. SWAPPED: layer is on disk

Note that states 1, 2, and 3 consume memory on the disk. The proposed design tries to keep a maximum number of layers in state 1, and a minimum number of layers in states 2 and 3 (this depends on the execution vs disk speed). This ensures maximum efficiency in the utilization of the memory.

kparichay avatar Sep 30 '21 01:09 kparichay