GLM icon indicating copy to clipboard operation
GLM copied to clipboard

Hardware requirements for GLM-chinese-10B

Open shaomai00 opened this issue 2 years ago • 9 comments

I had an 4 * V100 (4*32G) server, but OOM when I tried to finetune the GLM-chinese-10B model. What's the minimal hardware requirements?

shaomai00 avatar Nov 21 '22 09:11 shaomai00

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

duzx16 avatar Nov 22 '22 07:11 duzx16

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

157G memory 112G swap memory 4*v100 GPUs(32G*4)

To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?

"stage": 3,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 5e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu"
    },
    "offload_param": {
      "device": "cpu"
    }

Ant0082 avatar Jan 04 '23 10:01 Ant0082

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

157G memory 112G swap memory 4v100 GPUs(32G4)

To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?

"stage": 3,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 5e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu"
    },
    "offload_param": {
      "device": "cpu"
    }

In that case, I think either the CPU memory or the GPU memory is not enough to accommodate the optimizer state.

duzx16 avatar Jan 04 '23 12:01 duzx16

想问一下,如果设置了"cpu_offload": true,需要大概多少的显存呢?8*16G的V100是否能够跑起来呢

jiangix-paper avatar Feb 13 '23 03:02 jiangix-paper

想问一下,如果设置了"cpu_offload": true,需要大概多少的显存呢?8*16G的V100是否能够跑起来呢

设置 stage=2,加300G内存,8*16G应该可以跑起来

stage=2 300G 4*40G(实际占用~33G) 可以跑起来。

Ant0082 avatar Feb 14 '23 02:02 Ant0082

想问一下,如果设置了"cpu_offload": true,需要大概多少的显存呢?8*16G的V100是否能够跑起来呢

设置 stage=2,加300G内存,8*16G应该可以跑起来

stage=2 300G 440G(实际占用~33G) 可以跑起来。 stage=2,350G内存,cpu_offload设置为true,816G,跑不起来哎,显存不太够

eehover avatar Apr 02 '23 08:04 eehover

For finetuning, the optimize states consume a lot of memory. You can enable ZeRO-Offload (https://www.deepspeed.ai/tutorials/zero-offload/) to offload the optimizer states to the CPU memory. By default, we already enable that by setting "cpu_offload": true in config_tasks/config_blocklm_10B.json. Can you check whether the config file you are using also enables that? Without CPU-offload, it requires at least 16* V100 GPUs.

157G memory 112G swap memory 4_v100 GPUs(32G_4) To fine-tune the GLM-10-chinese model with this configuration, what configuration should be used for stage 3? Is this configuration possible?

"stage": 3,
    "contiguous_gradients": false,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 5e7,
    "sub_group_size": 1e9,
    "offload_optimizer": {
      "device": "cpu"
    },
    "offload_param": {
      "device": "cpu"
    }

In that case, I think either the CPU memory or the GPU memory is not enough to accommodate the optimizer state.

ZeRO-2 + cpu offload=True + batch=1 + fp16 4 * V100(32GB) + 5 * 3090(24GB) = 248GB情况下,出现OOM 请问是否用不同卡测试经验/结果,是否有使用多张3090测试经验/结果? 期待您的回答

pilipala818 avatar Apr 04 '23 09:04 pilipala818

内存多少呢?

Ant0082 avatar Apr 04 '23 09:04 Ant0082

内存多少呢?

ZeRO-2 + cpu offload=True + batch=1 + fp16 V100(32GB) * 2 + 3090(24GB) * 4 = 160GB

启动前内存情况 total:376GB,used:21GB,free:214GB

building GLM model ... 内存占用从21GB提升至97GB DeepSpeed is enabled. 内存回落至26GB

提示CPU Offload:True后就出现OOM

pilipala818 avatar Apr 06 '23 02:04 pilipala818