nntrainer icon indicating copy to clipboard operation
nntrainer copied to clipboard

Support memory swap

Open jihochu opened this issue 1 year ago • 4 comments

It proposes Memory Swap to reduce memory usage. Unused tensor data is swap-out to the external storage with its data, and it will be swap-in when it needs the data. It does not fix maximum size of memory usage, but it uses only mandatory memory with pre-calculation.

swap-out an swap-in points can be pre-defined with execution order which is already implemented, so it can reduce resources to choose victim for swap-out. (Common cache algorithm(lru, lfu..) is unnecessary), And original memory pool already optimizes memory usage based on execution-order and it does not changes while running train phases. New cache pool is introduced and it's inherited from memory pool to utilize the optimized memory information. All allocated memory is linked and managed by cache pool, and the alloced memory location could be replaced for the new swap-in allocation.

For the draft, the swap management is only applied to tensor info (#1965) The trial test results are as below:

- MIN AVR MAX
MNIST(orig) 17,112K 17,112K 17,112K
MNIST(swap) 16,114K 16,201K 16,308K
- - -5.32% -
Resnet(orig) 231,728K 231,728K 231,728K
Resnet(swap) 157,048K 195,842K 232,268K
- - -15.48% -

The result shows that

  1. The peak memory usage is same or similar with non-swap case.
  2. Original optimized memory plan utilizes almost whole allocated memory at least once.
  3. Deallocation of the memory cannot retrive whole real memory due to the kernel policy.

jihochu avatar Jul 19 '22 12:07 jihochu

:octocat: cibot: Thank you for posting issue #1966. The person in charge will reply soon.

taos-ci avatar Jul 19 '22 12:07 taos-ci

2nd Revision (#1965 is updated) It uses exact execution orders to obtain proper timing which we have to swap-out. At the 1st version, we keeps data alive while its usage is over. But, we can use execution order more slightly for find-grained timing control. For every execution order, unnecerray data is evicted, and only necessary data is loaded.

Beside to the 1st version, it is applied to both tensor and weight. Peak memory is reduced significantly, but the weight had no affects on peak memory. Detailed reason needs to be investigated. Defailed memory usage is presented in below:

MODEL MIN AVR MAX AVR(tensor+weight) MAX(tensor+weight)
MNIST(orig) 16,060K 16,060K 16,060K 1,609K 1,609K
MNIST(swap) 13,828K 14,107K 15,372K 214K 920K
- -13.8% -12.1% -4.2% -86.6% -42.8%
Resnet(orig) 230,648K 230,648K 230,648K 201,589K 201,589K
Resnet(swap) 15,928K 35,902K 189,632K 8,648K 175,392K
- -93.0% -84.4% -17.7% -95.7% -12.9%

jihochu avatar Aug 17 '22 06:08 jihochu

3rd Revision (#1965, #1987 updated)

Applied some opmizations:

  • flush initialized memory
  • optimize flush timing
  • remove unnecessary logs

Fixed some bugs

  • Fix for supporting multiple training
  • Fix for flush timing
MODEL MIN AVR MAX AVR(tensor+weight) MAX(tensor+weight)
MNIST(orig) 16,852K 16,852K 16,852K 1,707K 1,707K
MNIST(swap) 15,548K 15,620K 15,692K 221K 942K
- -7.7% -7.3% -6.8% -87.0% -44.8%
Resnet(orig) 231,800K 231,800K 231,800K 206,427K 206,427K
Resnet(swap) 20,512K 30,055K 69,372K 3,103K 38,576K
- -91.1% -87.0% -70.0% -98.4% -81.3%
VGG16(orig) 320,524K 387,881K 389,524K - -
VGG16(swap) 16,976K 53,323K 119,560K - -
- -94.7% -86.2% -69.3% - -

jihochu avatar Aug 25 '22 02:08 jihochu

This is great work! Using this swapping, we can save much more memory now. We can train the Resnet like model under 100 MB memory!!!

jijoongmoon avatar Aug 25 '22 04:08 jijoongmoon

I'll revisit this issue later with prelaod(#2034), and disk I/O performace (not yet opened issue).

jihochu avatar Nov 08 '22 01:11 jihochu