GLM
GLM copied to clipboard
Hardware requirements for GLM-chinese-10B
I had an 4 * V100 (4*32G) server, but OOM when I tried to finetune the GLM-chinese-10B model. What's the minimal hardware requirements?
For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true
in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that?
Without CPU-offload, it requires at least 16* V100 GPUs.
For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting
"cpu_offload": true
in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.
157G memory 112G swap memory 4*v100 GPUs(32G*4)
To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?
"stage": 3,
"contiguous_gradients": false,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 5e7,
"sub_group_size": 1e9,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
}
For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting
"cpu_offload": true
in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.157G memory 112G swap memory 4v100 GPUs(32G4)
To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?
"stage": 3, "contiguous_gradients": false, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_prefetch_bucket_size": 1e7, "stage3_param_persistence_threshold": 1e5, "reduce_bucket_size": 5e7, "sub_group_size": 1e9, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }
In that case, I think either the CPU memory or the GPU memory is not enough to accommodate the optimizer state.
想问一下,如果设置了"cpu_offload": true,需要大概多少的显存呢?8*16G的V100是否能够跑起来呢
想问一下,如果设置了"cpu_offload": true,需要大概多少的显存呢?8*16G的V100是否能够跑起来呢
设置 stage=2,加300G内存,8*16G应该可以跑起来
stage=2 300G 4*40G(实际占用~33G) 可以跑起来。
想问一下,如果设置了"cpu_offload": true,需要大概多少的显存呢?8*16G的V100是否能够跑起来呢
设置 stage=2,加300G内存,8*16G应该可以跑起来
stage=2 300G 440G(实际占用~33G) 可以跑起来。 stage=2,350G内存,cpu_offload设置为true,816G,跑不起来哎,显存不太够
For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting
"cpu_offload": true
in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.157G memory 112G swap memory 4_v100 GPUs(32G_4) To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?
"stage": 3, "contiguous_gradients": false, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_prefetch_bucket_size": 1e7, "stage3_param_persistence_threshold": 1e5, "reduce_bucket_size": 5e7, "sub_group_size": 1e9, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }
In that case, I think either the CPU memory or the GPU memory is not enough to accommodate the optimizer state.
ZeRO-2 + cpu offload=True + batch=1 + fp16 4 * V100(32GB) + 5 * 3090(24GB) = 248GB情况下,出现OOM 请问是否用不同卡测试经验/结果,是否有使用多张3090测试经验/结果? 期待您的回答
内存多少呢?
内存多少呢?
ZeRO-2 + cpu offload=True + batch=1 + fp16 V100(32GB) * 2 + 3090(24GB) * 4 = 160GB
启动前内存情况 total:376GB,used:21GB,free:214GB
building GLM model ... 内存占用从21GB提升至97GB DeepSpeed is enabled. 内存回落至26GB
提示CPU Offload:True后就出现OOM