GLM Hardware requirements for GLM-chinese-10B

I had an 4 * V100 (4*32G) server, but OOM when I tried to finetune the GLM-chinese-10B model. What's the minimal hardware requirements?

Nov 21 '22 09:11 shaomai00

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

Nov 22 '22 07:11 duzx16

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

157G memory 112G swap memory 4*v100 GPUs（32G*4）

To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?

"stage": 3,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 5e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu"
    },
    "offload_param": {
      "device": "cpu"
    }

Jan 04 '23 10:01 Ant0082

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

157G memory 112G swap memory 4v100 GPUs（32G4）

To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?
"stage": 3,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 5e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu"
    },
    "offload_param": {
      "device": "cpu"
    }

In that case, I think either the CPU memory or the GPU memory is not enough to accommodate the optimizer state.

Jan 04 '23 12:01 duzx16

想问一下，如果设置了"cpu_offload": true，需要大概多少的显存呢？8*16G的V100是否能够跑起来呢

Feb 13 '23 03:02 jiangix-paper

想问一下，如果设置了"cpu_offload": true，需要大概多少的显存呢？8*16G的V100是否能够跑起来呢

设置 stage=2，加300G内存，8*16G应该可以跑起来

stage=2 300G 4*40G(实际占用～33G) 可以跑起来。

Feb 14 '23 02:02 Ant0082

想问一下，如果设置了"cpu_offload": true，需要大概多少的显存呢？8*16G的V100是否能够跑起来呢

设置 stage=2，加300G内存，8*16G应该可以跑起来

stage=2 300G 440G(实际占用～33G) 可以跑起来。 stage=2,350G内存，cpu_offload设置为true，816G，跑不起来哎，显存不太够

Apr 02 '23 08:04 eehover

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

157G memory 112G swap memory 4_v100 GPUs（32G_4） To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?
"stage": 3,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 5e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu"
    },
    "offload_param": {
      "device": "cpu"
    }
In that case, I think either the CPU memory or the GPU memory is not enough to accommodate the optimizer state.

ZeRO-2 + cpu offload=True + batch=1 + fp16 4 * V100(32GB) + 5 * 3090(24GB) = 248GB情况下，出现OOM 请问是否用不同卡测试经验/结果，是否有使用多张3090测试经验/结果？期待您的回答

Apr 04 '23 09:04 pilipala818

内存多少呢？

Apr 04 '23 09:04 Ant0082

内存多少呢？

ZeRO-2 + cpu offload=True + batch=1 + fp16 V100(32GB) * 2 + 3090(24GB) * 4 = 160GB

启动前内存情况 total:376GB，used:21GB，free:214GB

building GLM model ... 内存占用从21GB提升至97GB DeepSpeed is enabled. 内存回落至26GB

提示CPU Offload:True后就出现OOM

Apr 06 '23 02:04 pilipala818

GLM GLM copied to clipboard

Hardware requirements for GLM-chinese-10B

GLM
GLM copied to clipboard