nntrainer
nntrainer copied to clipboard
Support memory swap
It proposes Memory Swap to reduce memory usage. Unused tensor data is swap-out to the external storage with its data, and it will be swap-in when it needs the data. It does not fix maximum size of memory usage, but it uses only mandatory memory with pre-calculation.
swap-out an swap-in points can be pre-defined with execution order which is already implemented, so it can reduce resources to choose victim for swap-out. (Common cache algorithm(lru, lfu..) is unnecessary), And original memory pool already optimizes memory usage based on execution-order and it does not changes while running train phases. New cache pool is introduced and it's inherited from memory pool to utilize the optimized memory information. All allocated memory is linked and managed by cache pool, and the alloced memory location could be replaced for the new swap-in allocation.
For the draft, the swap management is only applied to tensor info (#1965) The trial test results are as below:
- | MIN | AVR | MAX |
---|---|---|---|
MNIST(orig) | 17,112K | 17,112K | 17,112K |
MNIST(swap) | 16,114K | 16,201K | 16,308K |
- | - | -5.32% | - |
Resnet(orig) | 231,728K | 231,728K | 231,728K |
Resnet(swap) | 157,048K | 195,842K | 232,268K |
- | - | -15.48% | - |
The result shows that
- The peak memory usage is same or similar with non-swap case.
- Original optimized memory plan utilizes almost whole allocated memory at least once.
- Deallocation of the memory cannot retrive whole real memory due to the kernel policy.
:octocat: cibot: Thank you for posting issue #1966. The person in charge will reply soon.
2nd Revision (#1965 is updated) It uses exact execution orders to obtain proper timing which we have to swap-out. At the 1st version, we keeps data alive while its usage is over. But, we can use execution order more slightly for find-grained timing control. For every execution order, unnecerray data is evicted, and only necessary data is loaded.
Beside to the 1st version, it is applied to both tensor and weight. Peak memory is reduced significantly, but the weight had no affects on peak memory. Detailed reason needs to be investigated. Defailed memory usage is presented in below:
MODEL | MIN | AVR | MAX | AVR(tensor+weight) | MAX(tensor+weight) |
---|---|---|---|---|---|
MNIST(orig) | 16,060K | 16,060K | 16,060K | 1,609K | 1,609K |
MNIST(swap) | 13,828K | 14,107K | 15,372K | 214K | 920K |
- | -13.8% | -12.1% | -4.2% | -86.6% | -42.8% |
Resnet(orig) | 230,648K | 230,648K | 230,648K | 201,589K | 201,589K |
Resnet(swap) | 15,928K | 35,902K | 189,632K | 8,648K | 175,392K |
- | -93.0% | -84.4% | -17.7% | -95.7% | -12.9% |
3rd Revision (#1965, #1987 updated)
Applied some opmizations:
- flush initialized memory
- optimize flush timing
- remove unnecessary logs
Fixed some bugs
- Fix for supporting multiple training
- Fix for flush timing
MODEL | MIN | AVR | MAX | AVR(tensor+weight) | MAX(tensor+weight) |
---|---|---|---|---|---|
MNIST(orig) | 16,852K | 16,852K | 16,852K | 1,707K | 1,707K |
MNIST(swap) | 15,548K | 15,620K | 15,692K | 221K | 942K |
- | -7.7% | -7.3% | -6.8% | -87.0% | -44.8% |
Resnet(orig) | 231,800K | 231,800K | 231,800K | 206,427K | 206,427K |
Resnet(swap) | 20,512K | 30,055K | 69,372K | 3,103K | 38,576K |
- | -91.1% | -87.0% | -70.0% | -98.4% | -81.3% |
VGG16(orig) | 320,524K | 387,881K | 389,524K | - | - |
VGG16(swap) | 16,976K | 53,323K | 119,560K | - | - |
- | -94.7% | -86.2% | -69.3% | - | - |
This is great work! Using this swapping, we can save much more memory now. We can train the Resnet like model under 100 MB memory!!!
I'll revisit this issue later with prelaod(#2034), and disk I/O performace (not yet opened issue).