ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Small memory saving on NVME[BUG]:

Open MikeChenfu opened this issue 1 year ago • 6 comments

🐛 Describe the bug

Hello, I am using NVME to try 66B opt model with Zero3-offload cpu mode, but it seems to only saves 10% CPU memory. It takes about 1.06T CPU memory with NVME and 1.17T without NVME. Appreciate it if anyone has an idea about it.

Here is the code snapshot.

optimizer = GeminiAdamOptimizer(model, 
                lr=args.learning_rate, 
                initial_scale=2**14, 
                gpu_margin_mem_ratio=0.0,
                nvme_offload_fraction=1.0, 
                nvme_offload_dir='./nvme')

Environment

#### Installation Report ####

------------ Environment ------------
Colossal-AI version: 0.2.5
PyTorch version: 1.12.1
CUDA version: 11.3
CUDA version required by PyTorch: 11.3

Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it

------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A

Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime

------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A

MikeChenfu avatar Mar 09 '23 02:03 MikeChenfu

If parameters are mostly saved to main memory, the mode is actually targeted at minimal GPU memory usage. Could you please benchmark the GPU memory savings? And if you'd like, we may include your results as part of our benchmarking outcomes which will go public.

JThh avatar Mar 10 '23 06:03 JThh

Thanks @JThh for the reply. Currently we do not have GPU memory savings data, but we can do more testings for that. Do you have any ideas regarding the CPU memory saving? Can we have options to offload more parameters into SSD by NVME? I saw DeepSpeed is able to save more CPU memory usages. Thanks again!

MikeChenfu avatar Mar 10 '23 23:03 MikeChenfu

Thanks @MikeChenfu for your followup questions. This benchmark might be potentially indicative of how useful NVME can be. But regarding the maximal stretch we may have, would @oahzxl have any good idea on this?

JThh avatar Mar 11 '23 07:03 JThh

Thanks @MikeChenfu for your followup questions. This benchmark might be potentially indicative of how useful NVME can be. But regarding the maximal stretch we may have, would @oahzxl have any good idea on this?

Sorry, don't have any clue about it.

oahzxl avatar Mar 13 '23 08:03 oahzxl

How many nodes did you use?

ver217 avatar Mar 17 '23 03:03 ver217

@ver217 I trained opt 66b on the single A100.

MikeChenfu avatar Mar 17 '23 05:03 MikeChenfu

@ver217 I trained opt 66b on the single A100.

Hi @MikeChenfu 66b is much larger than the capacity of one A100 80GB memory. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 10:04 binmakeswell