CPM-1-Finetune-Text-Generation
CPM-1-Finetune-Text-Generation copied to clipboard
使用该项目微调CPM-distill模型,无法加载
由于没有足够显存,因此准备微调CPM-distill(官方发布CPM-LM(2.6B)模型的蒸馏版本),使用该项目时却发现无法加载模型,报尺寸错误,请求帮忙分析下是什么原因
(liubiao2) kingsoft@k8s-w-10-13-84-7:~/liubiao2/smartWriter/CPM-1-Finetune$ bash scripts/novel/finetune_novel_fp32.sh
using world size: 1 and model-parallel size: 1
> using dynamic loss scaling
> initializing model parallel with size 1
[2021-12-07 10:08:46,970] [INFO] [checkpointing.py:795:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-12-07 10:08:46,970] [INFO] [checkpointing.py:234:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 26051 and data parallel seed: 23333
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33461/33461 [00:01<00:00, 19888.74it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10526/10526 [00:00<00:00, 31213.74it/s]
building GPT2 model ...
> number of parameters on model parallel rank 0: 1023544320
50 98
Optimizer = FusedAdam
learning rate decaying linear
DeepSpeed is enabled.
[2021-12-07 10:09:04,694] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.5.8, git-hash=unknown, git-branch=unknown
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] initializing deepspeed groups using mpu
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] Initializing deepspeed groups with model parallel size 1, expert parallel size 1, and data parallel size 1
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0]
[2021-12-07 10:09:04,702] [INFO] [logging.py:69:log_dist] [Rank 0] creating expert parallel process group with ranks: [0]
[2021-12-07 10:09:04,707] [INFO] [engine.py:279:__init__] DeepSpeed Flops Profiler Enabled: False
[2021-12-07 10:09:04,707] [INFO] [engine.py:1095:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-07 10:09:04,707] [INFO] [engine.py:1100:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-07 10:09:04,711] [INFO] [engine.py:1117:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2021-12-07 10:09:04,712] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2021-12-07 10:09:04,712] [INFO] [engine.py:808:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-12-07 10:09:04,712] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = <learning_rates.AnnealingLR object at 0x7fa961ee9278>
[2021-12-07 10:09:04,712] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2021-12-07 10:09:04,712] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print] activation_checkpointing_config {
"partition_activations": true,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print] amp_enabled .................. False
[2021-12-07 10:09:04,712] [INFO] [config.py:1063:print] amp_params ................... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": null,
"exps_dir": null,
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] bfloat16_enabled ............. False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] checkpoint_tag_validation_enabled True
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] checkpoint_tag_validation_fail False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] communication_data_type ...... None
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] curriculum_enabled ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] curriculum_params ............ False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] dataloader_drop_last ......... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] disable_allgather ............ False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] dump_state ................... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] dynamic_loss_scale_args ...... None
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_enabled ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_gas_boundary_resolution 1
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_layer_name ........ bert.encoder.layer
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_layer_num ......... 0
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_max_iter .......... 100
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_stability ......... 1e-06
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_tol ............... 0.01
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] eigenvalue_verbose ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] elasticity_enabled ........... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] fp16_enabled ................. False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] fp16_master_weights_and_gradients False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] fp16_mixed_quantize .......... False
[2021-12-07 10:09:04,713] [INFO] [config.py:1063:print] global_rank .................. 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] gradient_accumulation_steps .. 4
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] gradient_clipping ............ 1.0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] gradient_predivide_factor .... 1.0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] initial_dynamic_scale ........ 4294967296
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] loss_scale ................... 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] memory_breakdown ............. False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] optimizer_legacy_fusion ...... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] optimizer_name ............... None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] optimizer_params ............. None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] pld_enabled .................. False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] pld_params ................... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] prescale_gradients ........... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_change_rate ......... 0.001
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_groups .............. 1
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_offset .............. 1000
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_period .............. 1000
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_rounding ............ 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_start_bits .......... 16
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_target_bits ......... 8
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_training_enabled .... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_type ................ 0
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] quantize_verbose ............. False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] scheduler_name ............... None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] scheduler_params ............. None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] sparse_attention ............. None
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] sparse_gradients_enabled ..... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] steps_per_print .............. 100000000
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] tensorboard_enabled .......... False
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-12-07 10:09:04,714] [INFO] [config.py:1063:print] tensorboard_output_path ......
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print] train_batch_size ............. 4
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print] train_micro_batch_size_per_gpu 1
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print] use_quantizer_kernel ......... False
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print] wall_clock_breakdown ......... False
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print] world_size ................... 1
[2021-12-07 10:09:04,715] [INFO] [config.py:1063:print] zero_allow_untested_optimizer False
[2021-12-07 10:09:04,734] [INFO] [config.py:1063:print] zero_config .................. {
"stage": 0,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": false,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+09,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false,
"ignore_unused_parameters": true,
"round_robin_gradients": false,
"legacy_stage1": false
}
[2021-12-07 10:09:04,734] [INFO] [config.py:1063:print] zero_enabled ................. False
[2021-12-07 10:09:04,734] [INFO] [config.py:1063:print] zero_optimization_stage ...... 0
[2021-12-07 10:09:04,734] [INFO] [config.py:1071:print] json = {
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"steps_per_print": 1.000000e+08,
"gradient_clipping": 1.0,
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
Using /home/kingsoft/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/kingsoft/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3976883888244629 seconds
[2021-12-07 10:09:06,672] [INFO] [state_dict_factory.py:109:get_merge_state_dicts] mp_rank: 0, ckpt_list: ['/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill/310000/mp_rank_00_model_states.pt', '/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill/310000/mp_rank_01_model_states.pt']
[2021-12-07 10:09:06,780] [INFO] [state_dict_factory.py:322:merge_state_dict] checkpoint version: 0
Traceback (most recent call last):
File "finetune_text_generation.py", line 324, in <module>
main()
File "finetune_text_generation.py", line 208, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 510, in setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 281, in load_checkpoint
checkpoint_name, sd = model.load_checkpoint(args.load, iteration, load_module_strict=False, load_optimizer_states=False, load_lr_scheduler_states=False)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2414, in load_checkpoint
load_module_only=load_module_only)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2459, in _load_checkpoint
strict=load_module_strict)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2302, in load_module_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/distributed.py", line 90, in load_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for GPT2Model:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([30000, 768]) from checkpoint, the shape in current model is torch.Size([30000, 2560]).
size mismatch for position_embeddings.weight: copying a param with shape torch.Size([1024, 768]) from checkpoint, the shape in current model is torch.Size([1024, 2560]).
size mismatch for transformer.layers.0.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.0.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.0.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.0.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.0.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.1.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.1.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.1.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.1.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.1.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.1.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.2.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.2.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.2.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.2.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.2.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.2.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.3.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.3.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.3.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.3.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.3.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.3.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.4.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.4.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.4.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.4.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.4.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.4.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.5.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.5.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.5.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.5.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.5.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.5.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.6.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.6.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.6.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.6.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.6.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.6.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.6.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.6.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.6.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.6.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.6.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.6.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.7.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.7.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.7.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.7.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.7.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.7.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.7.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.7.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.7.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.7.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.7.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.7.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.8.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.8.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.8.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.8.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.8.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.8.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.8.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.8.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.8.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.8.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.8.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.8.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.9.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.9.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.9.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.9.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.9.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.9.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.9.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.9.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.9.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.9.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.9.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.9.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.10.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.10.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.10.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.10.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.10.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.10.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.10.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.10.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.10.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.10.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.10.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.10.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.11.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.11.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.11.attention.query_key_value.weight: copying a param with shape torch.Size([2304, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.11.attention.query_key_value.bias: copying a param with shape torch.Size([2304]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.11.attention.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.11.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.11.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.11.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.11.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.11.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.11.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.11.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.final_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.final_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
Traceback (most recent call last):
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/kingsoft/anaconda3/envs/liubiao2/bin/python3', '-u', 'finetune_text_generation.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed_id/', '--model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '2560', '--load', '/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/scripts/novel/../ds_config/ds_finetune_large_fp32.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations', '--deepspeed-activation-checkpointing']' returned non-zero exit status 1.
您可以参考 https://github.com/TsinghuaAI/CPM-1-Distill 来修改配置
您可以参考 https://github.com/TsinghuaAI/CPM-1-Distill 来修改配置
您好,还是会出现错误,可以指出哪里配置有问题吗?多谢
finetune_novel_fp32.sh配置
#!/bin/bash
DATA_DIR="./data/novel/preprocessed_id/"
CHECKPOINT_PATH="/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill"
RESULTS_DIR="results/"
MODEL_NAME="finetune-novel"
TOKENIZER_PATH="bpe_3w_new/"
MPSIZE=1
NLAYERS=12
NHIDDEN=768
NATT=12
MAXSEQLEN=1024
CUR_PATH=$(realpath $0)
CUR_DIR=$(dirname ${CUR_PATH})
DS_CONFIG="${CUR_DIR}/../ds_config/ds_finetune_large_fp32.json"
python3 -m torch.distributed.launch --master_port ${1-1122} --nproc_per_node 1 finetune_text_generation.py \
--do_train \
--do_eval \
--data_dir ${DATA_DIR} \
--model-parallel-size ${MPSIZE} \
--num-layers ${NLAYERS} \
--hidden-size ${NHIDDEN} \
--load ${CHECKPOINT_PATH} \
--num-attention-heads ${NATT} \
--seq-length ${MAXSEQLEN} \
--max-position-embeddings 1024 \
--tokenizer-type GPT2BPETokenizer \
--tokenizer-path ${TOKENIZER_PATH} \
--vocab-size 30000 \
--lr 0.00001 \
--warmup 0.1 \
--batch-size 1 \
--deepspeed \
--deepspeed_config ${DS_CONFIG} \
--log-interval 10 \
--eval-interval 50 \
--seed 23333 \
--results_dir ${RESULTS_DIR} \
--model_name ${MODEL_NAME} \
--epoch 10 \
--checkpoint-activations \
--deepspeed-activation-checkpointing
报错提示:
Traceback (most recent call last):
File "finetune_text_generation.py", line 324, in <module>
main()
File "finetune_text_generation.py", line 238, in main
output = model(**batch)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1606, in forward
loss = self.module(*inputs, **kwargs)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/distributed.py", line 78, in forward
return self.module(*inputs, **kwargs)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/gpt2_modeling.py", line 97, in forward
transformer_output = self.transformer(embeddings, attention_mask)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 412, in forward
hidden_states, attention_mask)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 743, in checkpoint
CheckpointFunction.apply(function, all_outputs, *args)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 582, in forward
outputs = run_function(*inputs_cuda)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 402, in custom_forward
x_ = layer(x_, inputs[1])
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 288, in forward
attention_output = self.attention(layernorm_output, ltor_mask)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/mpu/transformer.py", line 132, in forward
attention_scores = torch.mul(attention_scores, ltor_mask) - \
RuntimeError: The size of tensor a (1024) must match the size of tensor b (1048576) at non-singleton dimension 3
Traceback (most recent call last):
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/kingsoft/anaconda3/envs/liubiao2/bin/python3', '-u', 'finetune_text_generation.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed_id/', '--model-parallel-size', '1', '--num-layers', '12', '--hidden-size', '768', '--load', '/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill', '--num-attention-heads', '12', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/scripts/novel/../ds_config/ds_finetune_large_fp32.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations', '--deepspeed-activation-checkpointing']' returned non-zero exit status 1.
deepspeed配置对了吗
deepspeed配置对了吗
deepspeed版本根据原项目配置的,版本 deepspeed==0.3.15 还是没有找出原因
不知道是不是跟单卡等有关,会出现莫名其妙的报错
ds_finetune_large_fp32.json
文件根据 https://github.com/TsinghuaAI/CPM-1-Distill/blob/main/configs/deepspeed/ds_zero2_config_small.json 来配置
多谢,我试试
您好,您那边可以帮忙运行试试吗?我这边始终会出现报错,而且按理说跟配置文件没有太大关系的 麻烦了,多谢
finetune_novel_fp32.sh文件:
#!/bin/bash
DATA_DIR="./data/novel/preprocessed_id/"
CHECKPOINT_PATH="./path_v2/to/CPM-distill"
RESULTS_DIR="results/"
MODEL_NAME="finetune-novel"
TOKENIZER_PATH="bpe_3w_new/"
MPSIZE=1
NLAYERS=6
NHIDDEN=2560
NATT=32
MAXSEQLEN=1024
CUR_PATH=$(realpath $0)
CUR_DIR=$(dirname ${CUR_PATH})
DS_CONFIG="${CUR_DIR}/../ds_config/ds_zero2_config_small.json"
python3 -m torch.distributed.launch --master_port ${1-1122} --nproc_per_node 1 finetune_text_generation.py \
--do_train \
--do_eval \
--data_dir ${DATA_DIR} \
--model-parallel-size ${MPSIZE} \
--num-layers ${NLAYERS} \
--hidden-size ${NHIDDEN} \
--load ${CHECKPOINT_PATH} \
--num-attention-heads ${NATT} \
--seq-length ${MAXSEQLEN} \
--max-position-embeddings 1024 \
--tokenizer-type GPT2BPETokenizer \
--tokenizer-path ${TOKENIZER_PATH} \
--vocab-size 30000 \
--lr 0.00001 \
--warmup 0.1 \
--batch-size 1 \
--deepspeed \
--deepspeed_config ${DS_CONFIG} \
--log-interval 10 \
--eval-interval 50 \
--seed 23333 \
--results_dir ${RESULTS_DIR} \
--model_name ${MODEL_NAME} \
--epoch 10 \
--checkpoint-activations \
--deepspeed-activation-checkpointing
报错信息:
(liubiao2) kingsoft@k8s-w-10-13-84-7:~/liubiao2/smartWriter/CPM-1-Finetune$ bash scripts/novel/finetune_novel_fp32.sh
using world size: 1 and model-parallel size: 1
> using dynamic loss scaling
> initializing model parallel with size 1
[2021-12-09 10:11:40,637] [INFO] [checkpointing.py:734:_configure_using_config_file] {'partition_activations': True, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2021-12-09 10:11:40,637] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 26051 and data parallel seed: 23333
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6777/6777 [00:00<00:00, 15329.86it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 385/385 [00:00<00:00, 35938.91it/s]
building GPT2 model ...
> number of parameters on model parallel rank 0: 551485440
26 50
Optimizer = FusedAdam
learning rate decaying linear
DeepSpeed is enabled.
[2021-12-09 10:11:48,665] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15, git-hash=unknown, git-branch=unknown
[2021-12-09 10:11:48,674] [INFO] [engine.py:605:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-09 10:11:48,674] [INFO] [engine.py:609:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-09 10:11:48,674] [INFO] [engine.py:619:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
Checking ZeRO support for optimizer=FusedAdam type=<class 'apex.optimizers.fused_adam.FusedAdam'>
[2021-12-09 10:11:48,674] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2021-12-09 10:11:48,674] [INFO] [stage2.py:101:__init__] Reduce bucket size 500000000
[2021-12-09 10:11:48,674] [INFO] [stage2.py:102:__init__] Allgather bucket size 500000000
[2021-12-09 10:11:48,674] [INFO] [stage2.py:103:__init__] CPU Offload: False
Using /home/kingsoft/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/kingsoft/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.28600168228149414 seconds
[2021-12-09 10:11:50,219] [INFO] [stage2.py:375:__init__] optimizer state initialized
[2021-12-09 10:11:50,219] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2021-12-09 10:11:50,219] [INFO] [engine.py:455:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-12-09 10:11:50,219] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <learning_rates.AnnealingLR object at 0x7efcedbc7978>
[2021-12-09 10:11:50,219] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2021-12-09 10:11:50,219] [INFO] [config.py:741:print] DeepSpeedEngine configuration:
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] activation_checkpointing_config {
"partition_activations": true,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] allreduce_always_fp32 ........ False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] amp_enabled .................. False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] amp_params ................... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] checkpoint_tag_validation_enabled True
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] checkpoint_tag_validation_fail False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] disable_allgather ............ False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] dump_state ................... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] dynamic_loss_scale_args ...... {'init_scale': 262144, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] elasticity_enabled ........... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 3,
"detailed": true
}
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] fp16_enabled ................. True
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] global_rank .................. 0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] gradient_accumulation_steps .. 2
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] gradient_clipping ............ 1.0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] gradient_predivide_factor .... 1.0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] initial_dynamic_scale ........ 262144
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] loss_scale ................... 0
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] memory_breakdown ............. False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] optimizer_legacy_fusion ...... False
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] optimizer_name ............... None
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] optimizer_params ............. None
[2021-12-09 10:11:50,220] [INFO] [config.py:745:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] pld_enabled .................. False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] pld_params ................... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] prescale_gradients ........... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] scheduler_name ............... None
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] scheduler_params ............. None
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] sparse_attention ............. None
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] sparse_gradients_enabled ..... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] steps_per_print .............. 100
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] tensorboard_enabled .......... False
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] tensorboard_output_path ......
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] train_batch_size ............. 2
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] train_micro_batch_size_per_gpu 1
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] wall_clock_breakdown ......... True
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] world_size ................... 1
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] zero_allow_untested_optimizer True
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] zero_config .................. {
"stage": 2,
"contiguous_gradients": false,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": false,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+12,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false
}
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] zero_enabled ................. True
[2021-12-09 10:11:50,221] [INFO] [config.py:745:print] zero_optimization_stage ...... 2
[2021-12-09 10:11:50,221] [INFO] [config.py:752:print] json = {
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 2,
"steps_per_print": 100,
"zero_optimization": {
"stage": 2
},
"zero_allow_untested_optimizer": true,
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 18,
"hysteresis": 2,
"min_loss_scale": 1
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": true
}
Using /home/kingsoft/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00041604042053222656 seconds
[2021-12-09 10:11:50,222] [INFO] [engine.py:1464:_load_checkpoint] rank: 0 loading checkpoint: ./path_v2/to/CPM-distill/310000/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "finetune_text_generation.py", line 324, in <module>
main()
File "finetune_text_generation.py", line 208, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 510, in setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/utils.py", line 281, in load_checkpoint
checkpoint_name, sd = model.load_checkpoint(args.load, iteration, load_module_strict=False, load_optimizer_states=False, load_lr_scheduler_states=False)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1440, in load_checkpoint
load_lr_scheduler_states=load_lr_scheduler_states)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1472, in _load_checkpoint
strict=load_module_strict)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1373, in load_module_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/model/distributed.py", line 90, in load_state_dict
self.module.load_state_dict(state_dict, strict=strict)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for GPT2Model:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([15000, 768]) from checkpoint, the shape in current model is torch.Size([30000, 2560]).
size mismatch for position_embeddings.weight: copying a param with shape torch.Size([1024, 768]) from checkpoint, the shape in current model is torch.Size([1024, 2560]).
size mismatch for transformer.layers.0.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.0.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.0.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.0.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.0.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.0.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.1.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.1.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.1.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.1.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.1.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.1.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.1.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.2.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.2.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.2.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.2.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.2.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.2.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.2.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.3.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.3.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.3.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.3.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.3.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.3.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.3.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.4.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.4.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.4.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.4.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.4.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.4.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.4.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.input_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.input_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.attention.query_key_value.weight: copying a param with shape torch.Size([1152, 768]) from checkpoint, the shape in current model is torch.Size([7680, 2560]).
size mismatch for transformer.layers.5.attention.query_key_value.bias: copying a param with shape torch.Size([1152]) from checkpoint, the shape in current model is torch.Size([7680]).
size mismatch for transformer.layers.5.attention.dense.weight: copying a param with shape torch.Size([768, 384]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for transformer.layers.5.attention.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.post_attention_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.post_attention_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.layers.5.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([1536, 768]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for transformer.layers.5.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([1536]) from checkpoint, the shape in current model is torch.Size([10240]).
size mismatch for transformer.layers.5.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([768, 1536]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for transformer.layers.5.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.final_layernorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
size mismatch for transformer.final_layernorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([2560]).
Traceback (most recent call last):
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/kingsoft/anaconda3/envs/liubiao2/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/kingsoft/anaconda3/envs/liubiao2/bin/python3', '-u', 'finetune_text_generation.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed_id/', '--model-parallel-size', '1', '--num-layers', '6', '--hidden-size', '2560', '--load', './path_v2/to/CPM-distill', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/home/kingsoft/liubiao2/smartWriter/CPM-1-Finetune/scripts/novel/../ds_config/ds_zero2_config_small.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations', '--deepspeed-activation-checkpointing']' returned non-zero exit status 1.
您好,我现在不太方便调试。您的这个错误是层数没配置对,您可以尝试按这个
#!/bin/bash
DATA_DIR="./data/novel/preprocessed_id/"
CHECKPOINT_PATH="/home/kingsoft/liubiao2/smartWriter/CPM/model/CPM-distill"
RESULTS_DIR="results/"
MODEL_NAME="finetune-novel"
TOKENIZER_PATH="bpe_3w_new/"
MPSIZE=1
NLAYERS=12
NHIDDEN=768
NATT=12
MAXSEQLEN=1024
CUR_PATH=$(realpath $0)
CUR_DIR=$(dirname ${CUR_PATH})
DS_CONFIG="${CUR_DIR}/../ds_config/ds_finetune_large_fp32.json"
python3 -m torch.distributed.launch --master_port ${1-1122} --nproc_per_node 1 finetune_text_generation.py \
--do_train \
--do_eval \
--data_dir ${DATA_DIR} \
--model-parallel-size ${MPSIZE} \
--num-layers ${NLAYERS} \
--hidden-size ${NHIDDEN} \
--load ${CHECKPOINT_PATH} \
--num-attention-heads ${NATT} \
--seq-length ${MAXSEQLEN} \
--max-position-embeddings 1024 \
--tokenizer-type GPT2BPETokenizer \
--tokenizer-path ${TOKENIZER_PATH} \
--vocab-size 30000 \
--lr 0.00001 \
--warmup 0.1 \
--batch-size 1 \
--deepspeed \
--deepspeed_config ${DS_CONFIG} \
--log-interval 10 \
--eval-interval 50 \
--seed 23333 \
--results_dir ${RESULTS_DIR} \
--model_name ${MODEL_NAME} \
--epoch 10 \
--checkpoint-activations \
--deepspeed-activation-checkpointing
来配置,并且根据 https://github.com/TsinghuaAI/CPM-1-Distill/blob/main/configs/deepspeed/ds_zero2_config_small.json 来配置ds_finetune_large_fp32.json
deepspeed文件
修改ds_finetune_large_fp32.json
对应参数即可
还是不行,参数尝试了很多种,都会有问题。打算放弃调研CPM了,一方面模型太大,另一方面生成文本速度太慢。
您可以推荐几个小说文本生成模型吗?对这方面不是特别熟悉,感谢
-
那可能蒸馏之后的模型相应的计算方式也更改了,您可以 使用https://github.com/TsinghuaAI/CPM-1-Distill 这个repo的代码运行,再结合 https://github.com/zhenhao-huang/CPM-1-Finetune-Text-Generation/blob/main/finetune_text_generation.py 修改相应部分的文本生成模板。
-
https://github.com/Morizeyao/GPT2-Chinese 这个repo也可以用来做文本生成,不过目前的趋势都是参数量越大效果越好。
好的,多谢