ColossalAI
ColossalAI copied to clipboard
Small memory saving on NVME[BUG]:
🐛 Describe the bug
Hello, I am using NVME to try 66B opt model with Zero3-offload cpu mode, but it seems to only saves 10% CPU memory. It takes about 1.06T CPU memory with NVME and 1.17T without NVME. Appreciate it if anyone has an idea about it.
Here is the code snapshot.
optimizer = GeminiAdamOptimizer(model,
lr=args.learning_rate,
initial_scale=2**14,
gpu_margin_mem_ratio=0.0,
nvme_offload_fraction=1.0,
nvme_offload_dir='./nvme')
Environment
#### Installation Report ####
------------ Environment ------------
Colossal-AI version: 0.2.5
PyTorch version: 1.12.1
CUDA version: 11.3
CUDA version required by PyTorch: 11.3
Note:
1. The table above checks the versions of the libraries/tools in the current environment
2. If the CUDA version is N/A, you can set the CUDA_HOME environment variable to locate it
------------ CUDA Extensions AOT Compilation ------------
Found AOT CUDA Extension: ✓
PyTorch version used for AOT compilation: N/A
CUDA version used for AOT compilation: N/A
Note:
1. AOT (ahead-of-time) compilation of the CUDA kernels occurs during installation when the environment varialbe CUDA_EXT=1 is set
2. If AOT compilation is not enabled, stay calm as the CUDA kernels can still be built during runtime
------------ Compatibility ------------
PyTorch version match: N/A
System and PyTorch CUDA version match: ✓
System and Colossal-AI CUDA version match: N/A
If parameters are mostly saved to main memory, the mode is actually targeted at minimal GPU memory usage. Could you please benchmark the GPU memory savings? And if you'd like, we may include your results as part of our benchmarking outcomes which will go public.
Thanks @JThh for the reply. Currently we do not have GPU memory savings data, but we can do more testings for that. Do you have any ideas regarding the CPU memory saving? Can we have options to offload more parameters into SSD by NVME? I saw DeepSpeed is able to save more CPU memory usages. Thanks again!
Thanks @MikeChenfu for your followup questions. This benchmark might be potentially indicative of how useful NVME can be. But regarding the maximal stretch we may have, would @oahzxl have any good idea on this?
Thanks @MikeChenfu for your followup questions. This benchmark might be potentially indicative of how useful NVME can be. But regarding the maximal stretch we may have, would @oahzxl have any good idea on this?
Sorry, don't have any clue about it.
How many nodes did you use?
@ver217 I trained opt 66b on the single A100.
@ver217 I trained opt 66b on the single A100.
Hi @MikeChenfu 66b is much larger than the capacity of one A100 80GB memory. This issue was closed due to inactivity. Thanks.