[rank4]: Input should be a valid integer, got a number with a fractional part [type=int_from_float, input_value=15099494.4, input_type=float]
我使用官方提供的脚本和数据集先后运行了python pre_tokenize_glm4.py python sort_and_group.py --group_size 8 --train_file /home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/datasets 得到了attention_masks_pack.json ,inputs_pack.npy等文件 运行训练脚本 ./glm4_longwriter.sh 时,遇到与 DeepSpeedZeroConfig 配置相关的 ValidationError。错误是由于 stage3_prefetch_bucket_size 的输入类型无效,期望为整数但接收到浮点数。
训练日志:
[2024-08-26 09:58:48,719] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-26 09:58:49,793] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 09:58:50,631] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,737] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,784] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:50,799] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:51,320] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 09:58:52,754] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:52,859] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:53,039] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:58:53,301] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 09:59:10,505] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 283, num_elems = 9.40B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.15s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.18s/it]
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.158402919769287 seconds
[rank4]: Traceback (most recent call last):
[rank4]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
而且如果我把stage3.json中的 "stage3_prefetch_bucket_size": "auto",改为 "stage3_prefetch_bucket_size": 15099494,运行会出现如下错误:
[2024-08-26 10:00:37,155] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,222] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,235] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,236] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,301] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,331] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,358] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:37,386] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-26 10:00:38,665] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,716] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,783] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,791] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,810] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,868] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,891] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:38,891] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-08-26 10:00:39,001] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-26 10:00:40,846] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:40,934] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,119] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,127] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,138] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,236] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,240] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:00:41,249] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
[2024-08-26 10:01:00,375] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 283, num_elems = 9.40B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00, 1.21s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.19s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00, 1.23s/it]
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
loading data...
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.942378044128418 seconds
finish loading data
finish loading data
finish loading data
finish loading data
finish loading data
finish loading data
finish loading data
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.7372360229492188 seconds
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.803518056869507 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.814899444580078 seconds
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.760847568511963 seconds
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/hnjj/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/hnjj/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.765498161315918 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.718514919281006 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.747807502746582 seconds
Parameter Offload: Total persistent parameters: 516096 in 121 params
wandb: W&B API key is configured. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.7
wandb: Run data is saved locally in /home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/wandb/run-20240826_100248-p37tgoc6
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run glm4_longwriter_szf
wandb: ⭐️ View project at https://wandb.ai/beijingdaxue/huggingface
wandb: 🚀 View run at https://wandb.ai/beijingdaxue/huggingface/runs/p37tgoc6
0%| | 0/2752 [00:00<?, ?it/s][rank7]: Traceback (most recent call last):
[rank7]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
我还遇到了这个: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward loss = self.module(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 994, in forward transformer_outputs = self.transformer( ^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 882, in forward full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 784, in get_masks full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1) ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~ RuntimeError: The size of tensor a (32768) must match the size of tensor b (6) at non-singleton dimension 1
我还遇到了这个: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward loss = self.module(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 994, in forward transformer_outputs = self.transformer( ^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 882, in forward full_attention_mask = self.get_masks(input_ids, past_key_values, padding_mask=attention_mask) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 784, in get_masks full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1) ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~ RuntimeError: The size of tensor a (32768) must match the size of tensor b (6) at non-singleton dimension 1
是的,我现在也是到这一步卡住了,目前和你的报错一样
(T_T)
我们目前提供的GLM-4-9B模型训练代码需要transformers==4.33.0的环境,更高的transformers环境可能导致错误。为了支持packing training,请用patch/下提供的modeling_chatglm.py文件替换原始模型的modeling_chatglm.py.
我们目前提供的
GLM-4-9B模型训练代码需要transformers==4.33.0的环境,更高的transformers环境可能导致错误。为了支持packing training,请用patch/下提供的modeling_chatglm.py文件替换原始模型的modeling_chatglm.py.
目前已经换成4.33.0,而且modeling_chatglm.py也已替换,但是出现如下报错:
rank7]: Traceback (most recent call last):
[rank7]: File "/home/hnjj/diskdata/yuanshi/media/szf/llm/glm_longwrite/LongWriter/train/main.py", line 130, in
你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init [rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)。这是原始hf库中的代码才有的。
请问训练支持glm-4-9b-chat吗?不是glm-4-9b
我们建议从glm-4-9b(base)模型开始进行混训(通用SFT数据+LongWriter-6k数据)。直接从glm-4-9b-chat训练的效果会大打折扣。
你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init [rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)。这是原始hf库中的代码才有的。
我试了确实是,替换了原来的文件后,运行train文件,就会使用的还是原来的modeling_chatglm.py文件
你这里应该是没有成功替换,我们训练时的modeling_chatglm.py代码中没有这一行:File "/home/hnjj/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py", line 416, in init [rank7]: self.core_attention = CORE_ATTENTION_CLASSES[config._attn_implementation](config, self.layer_number)。这是原始hf库中的代码才有的。
我试了确实是,替换了原来的文件后,运行train文件,就会使用的还是原来的modeling_chatglm.py文件
你需要在load时候传入参数trust_remote_code=True
Traceback (most recent call last):
File "/gemini/code/train/main.py", line 130, in
我换成了glm-4-9b模型,也换了modeling_chatglm.py文件,但是现在报了一个新的错
@sunzhufeng12345 @badarrrr 请看我们在README中的FAQ是否能解决你们遇到的问题。不好意思让你们久等了。